Per-Book Language Profile Reports¶

The profiles module generates standardized statistical summaries for any Bible book:

Word count & vocabulary richness — type-token ratio (TTR), hapax legomena
Part-of-speech distribution — compared to the full OT or NT corpus average
Verb analysis — stem (binyan) distribution for Hebrew OT, tense/voice for Greek NT
Top 20 lexical lemmas — most frequent Strong's numbers in the book

Reports can be printed to the console, returned as a dict for further analysis, or saved as markdown files.

In [ ]:

Copied!





# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")
# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")

In [1]:

Copied!

import sys
sys.path.insert(0, '../../../src')

import pandas as pd
pd.set_option('display.max_colwidth', 100)

from bible_grammar.reporting.profiles import book_profile, print_profile, save_profile_report, batch_profiles
import sys
sys.path.insert(0, '../../../src')

import pandas as pd
pd.set_option('display.max_colwidth', 100)

from bible_grammar.reporting.profiles import book_profile, print_profile, save_profile_report, batch_profiles

1. OT Profile: Genesis¶

In [2]:

Copied!

print_profile('Gen')
print_profile('Gen')

============================================================
  Genesis (Gen)  —  Old Testament
============================================================
  Words: 20,161  |  Unique lemmas: 2,041  |  TTR: 0.101  |  Hapax: 811
  Chapters: 50

  Part-of-Speech Distribution (top 8):
    Noun              36.4%  (-4.7% vs corpus)  ██████████████████
    Verb              22.4%  (++0.9% vs corpus)  ███████████
    Suffix            17.1%  (++2.0% vs corpus)  ████████
    Particle          11.4%  (++0.7% vs corpus)  █████
    Preposition        3.6%  (-0.1% vs corpus)  █
    Adjective          3.5%  (++0.6% vs corpus)  █
    Pronoun            2.3%  (++0.3% vs corpus)  █
    Adverb             1.7%  (++0.3% vs corpus)  
    c                  0.9%  (++0.4% vs corpus)  
    a                  0.6%  (-0.1% vs corpus)  

  Verbs: 4,511 total
  Stem Distribution:
    Qal             78.3%  (++7.2% vs corpus)
    Hiphil           8.8%  (-2.6% vs corpus)
    Piel             6.2%  (-1.9% vs corpus)
    Niphal           4.3%  (-1.8% vs corpus)
    Hithpael         1.0%  (-0.4% vs corpus)
    Pual             0.6%  (-0.2% vs corpus)
    Hophal           0.5%  (-0.1% vs corpus)
    Haphel           0.5%  (++0.3% vs corpus)

  Top 10 Lexical Lemmas (Strong's):
    H0853                 985 occurrences
    H0559                 582 occurrences
    H0413                 465 occurrences
    H0834A                360 occurrences
    H3605                 339 occurrences
    H1961                 313 occurrences
    H1121A                308 occurrences
    H5921A                297 occurrences
    H0776G                286 occurrences
    H3588A                267 occurrences

2. OT Profile: Isaiah (the most prophetically verb-rich book)¶

In [3]:

Copied!

print_profile('Isa')
print_profile('Isa')

============================================================
  Isaiah (Isa)  —  Old Testament
============================================================
  Words: 16,509  |  Unique lemmas: 2,753  |  TTR: 0.167  |  Hapax: 1,178
  Chapters: 66

  Part-of-Speech Distribution (top 8):
    Noun              38.2%  (-2.9% vs corpus)  ███████████████████
    Verb              25.9%  (++4.4% vs corpus)  ████████████
    Suffix            15.7%  (++0.6% vs corpus)  ███████
    Particle           9.4%  (-1.3% vs corpus)  ████
    Preposition        2.8%  (-0.9% vs corpus)  █
    Adjective          2.6%  (-0.3% vs corpus)  █
    Pronoun            2.3%  (++0.3% vs corpus)  █
    Adverb             1.8%  (++0.4% vs corpus)  
    a                  0.9%  (++0.2% vs corpus)  
    c                  0.3%  (-0.2% vs corpus)  

  Verbs: 4,274 total
  Stem Distribution:
    Qal             66.2%  (-4.9% vs corpus)
    Hiphil          12.2%  (++0.8% vs corpus)
    Niphal           8.8%  (++2.7% vs corpus)
    Piel             8.1%  (+0.0% vs corpus)
    Hithpael         1.8%  (++0.4% vs corpus)
    Pual             1.8%  (++1.0% vs corpus)
    Hophal           0.9%  (++0.3% vs corpus)
    Haphel           0.2%  (+0.0% vs corpus)
    u                0.0%  (-0.1% vs corpus)

  Top 10 Lexical Lemmas (Strong's):
    H3808                 429 occurrences
    H3068G                416 occurrences
    H5921A                327 occurrences
    H3588A                318 occurrences
    H0559                 246 occurrences
    H3605                 239 occurrences
    H0853                 211 occurrences
    H1961                 202 occurrences
    H0413                 180 occurrences
    H0776G                176 occurrences

3. NT Profile: Romans¶

In [4]:

Copied!

print_profile('Rom')
print_profile('Rom')

============================================================
  Romans (Rom)  —  New Testament
============================================================
  Words: 7,175  |  Unique lemmas: 1,070  |  TTR: 0.149  |  Hapax: 583
  Chapters: 16

  Part-of-Speech Distribution (top 8):
    Noun              23.7%  (++3.0% vs corpus)  ███████████
    Verb              16.4%  (-3.7% vs corpus)  ████████
    Article           15.6%  (++1.0% vs corpus)  ███████
    Conjunction       12.9%  (-0.1% vs corpus)  ██████
    Preposition        9.1%  (++1.0% vs corpus)  ████
    Pronoun            6.0%  (-2.1% vs corpus)  ███
    Adjective          5.8%  (-0.2% vs corpus)  ██
    Particle           4.0%  (++1.2% vs corpus)  ██
    R                  1.4%  (++0.3% vs corpus)  
    Adverb             1.3%  (-0.4% vs corpus)  

  Verbs: 1,177 total
  Tense Distribution:
    Present         52.6%
    Aorist          20.6%
    2nd Aorist      10.2%
    Future           8.4%
    R                5.9%
    Imperfect        1.3%
    2R               0.8%
    2L               0.1%
  Voice Distribution:
    Active          68.0%
    Passive         16.0%
    Deponent         6.5%
    N                6.0%
    Middle           3.1%
    Middle Deponent   0.3%

  Top 10 Lexical Lemmas (Strong's):
    G3588                1121 occurrences
    G2532                 292 occurrences
    G1722                 175 occurrences
    G0846                 159 occurrences
    G2316                 151 occurrences
    G1161                 147 occurrences
    G1063                 146 occurrences
    G4771                 139 occurrences
    G3756                 133 occurrences
    G1519                 116 occurrences

4. NT Profile: Revelation¶

In [5]:

Copied!

print_profile('Rev')
print_profile('Rev')

============================================================
  Revelation (Rev)  —  New Testament
============================================================
  Words: 10,167  |  Unique lemmas: 943  |  TTR: 0.093  |  Hapax: 322
  Chapters: 22

  Part-of-Speech Distribution (top 8):
    Noun              23.6%  (++2.9% vs corpus)  ███████████
    Article           19.3%  (++4.7% vs corpus)  █████████
    Verb              15.8%  (-4.3% vs corpus)  ███████
    Conjunction       13.7%  (++0.7% vs corpus)  ██████
    Adjective          8.2%  (++2.2% vs corpus)  ████
    Preposition        7.4%  (-0.7% vs corpus)  ███
    Pronoun            6.3%  (-1.8% vs corpus)  ███
    Particle           2.0%  (-0.8% vs corpus)  █
    R                  0.8%  (-0.3% vs corpus)  
    Adverb             0.8%  (-0.9% vs corpus)  

  Verbs: 1,610 total
  Tense Distribution:
    Present         39.3%
    Aorist          25.8%
    2nd Aorist      16.6%
    R                7.5%
    Future           7.3%
    Imperfect        2.7%
    2R               0.4%
    2nd Future       0.1%
  Voice Distribution:
    Active          75.1%
    Passive         13.9%
    N                5.8%
    Deponent         3.8%
    Middle           1.1%
    Middle or Passive   0.2%

  Top 10 Lexical Lemmas (Strong's):
    G3588                1958 occurrences
    G2532                1176 occurrences
    G0846                 450 occurrences
    G1722                 165 occurrences
    G1909                 145 occurrences
    G1537                 140 occurrences
    G1510                 121 occurrences
    G2192                 103 occurrences
    G2316                  99 occurrences
    G3004G                 92 occurrences

5. Programmatic Access: book_profile() returns a dict¶

In [6]:

Copied!





p = book_profile('Jhn')

print(f"Book: {p['book_name']}")
print(f"Words: {p['total_words']:,}  |  Unique lemmas: {p['unique_strongs']:,}  |  TTR: {p['ttr']:.3f}")
print(f"Hapax legomena: {p['hapax_count']:,} ({p['hapax_count']/p['unique_strongs']*100:.1f}% of vocab)")
print()

# POS as a DataFrame
pos_df = pd.Series(p['pos_distribution'], name='pct').reset_index()
pos_df.columns = ['pos', 'pct']
pos_df['corpus_avg'] = pos_df['pos'].map(p['baseline']['pos_pct'])
pos_df['delta'] = (pos_df['pct'] - pos_df['corpus_avg']).round(1)
print(pos_df.sort_values('pct', ascending=False).to_string(index=False))
p = book_profile('Jhn')

print(f"Book: {p['book_name']}")
print(f"Words: {p['total_words']:,}  |  Unique lemmas: {p['unique_strongs']:,}  |  TTR: {p['ttr']:.3f}")
print(f"Hapax legomena: {p['hapax_count']:,} ({p['hapax_count']/p['unique_strongs']*100:.1f}% of vocab)")
print()

# POS as a DataFrame
pos_df = pd.Series(p['pos_distribution'], name='pct').reset_index()
pos_df.columns = ['pos', 'pct']
pos_df['corpus_avg'] = pos_df['pos'].map(p['baseline']['pos_pct'])
pos_df['delta'] = (pos_df['pct'] - pos_df['corpus_avg']).round(1)
print(pos_df.sort_values('pct', ascending=False).to_string(index=False))

Book: John
Words: 16,069  |  Unique lemmas: 1,051  |  TTR: 0.065
Hapax legomena: 397 (37.8% of vocab)

        pos  pct  corpus_avg  delta
       Verb 22.7        20.1    2.6
       Noun 17.2        20.7   -3.5
    Article 14.4        14.6   -0.2
Conjunction 12.8        13.0   -0.2
    Pronoun 11.0         8.1    2.9
Preposition  7.0         8.1   -1.1
  Adjective  3.8         6.0   -2.2
   Particle  3.2         2.8    0.4
          D  2.0         1.2    0.8
     Adverb  2.0         1.7    0.3

6. Cross-Book Comparison: Verb Stem Distribution across Torah¶

In [7]:

Copied!





torah = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
rows = []
for book_id in torah:
    p = book_profile(book_id)
    stems = p['verb_detail']['stem_distribution']
    row = {'book': p['book_name']}
    row.update(stems)
    rows.append(row)

stem_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
# Show only major stems
major_stems = ['Qal', 'Niphal', 'Piel', 'Hiphil', 'Pual', 'Hophal', 'Hithpael']
stem_cmp = stem_cmp[[c for c in major_stems if c in stem_cmp.columns]]
print("Verb Stem Distribution (%) across Torah:")
stem_cmp
torah = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
rows = []
for book_id in torah:
    p = book_profile(book_id)
    stems = p['verb_detail']['stem_distribution']
    row = {'book': p['book_name']}
    row.update(stems)
    rows.append(row)

stem_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
# Show only major stems
major_stems = ['Qal', 'Niphal', 'Piel', 'Hiphil', 'Pual', 'Hophal', 'Hithpael']
stem_cmp = stem_cmp[[c for c in major_stems if c in stem_cmp.columns]]
print("Verb Stem Distribution (%) across Torah:")
stem_cmp

Verb Stem Distribution (%) across Torah:

Out[7]:

	Qal	Niphal	Piel	Hiphil	Pual	Hophal	Hithpael
book
Genesis	78.3	4.3	6.2	8.8	0.6	0.5	1.0
Exodus	71.6	4.9	10.8	9.2	1.1	1.6	0.6
Leviticus	60.9	7.3	13.6	14.9	0.4	1.5	1.2
Numbers	72.8	5.1	10.1	9.2	0.4	0.7	1.6
Deuteronomy	74.5	5.1	8.6	9.8	0.3	0.3	1.2

7. Cross-Book Comparison: Tense Distribution across Paul's Letters¶

In [8]:

Copied!





pauline = ['Rom', '1Co', '2Co', 'Gal', 'Eph', 'Php', 'Col', 'Heb']
rows = []
for book_id in pauline:
    p = book_profile(book_id)
    tenses = p['verb_detail'].get('tense_distribution', {})
    row = {'book': p['book_name']}
    row.update(tenses)
    rows.append(row)

tense_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
major_tenses = ['Present', 'Aorist', '2nd Aorist', 'Perfect', 'Future', 'Imperfect']
tense_cmp = tense_cmp[[c for c in major_tenses if c in tense_cmp.columns]]
print("Verb Tense Distribution (%) across Paul's Letters:")
tense_cmp
pauline = ['Rom', '1Co', '2Co', 'Gal', 'Eph', 'Php', 'Col', 'Heb']
rows = []
for book_id in pauline:
    p = book_profile(book_id)
    tenses = p['verb_detail'].get('tense_distribution', {})
    row = {'book': p['book_name']}
    row.update(tenses)
    rows.append(row)

tense_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
major_tenses = ['Present', 'Aorist', '2nd Aorist', 'Perfect', 'Future', 'Imperfect']
tense_cmp = tense_cmp[[c for c in major_tenses if c in tense_cmp.columns]]
print("Verb Tense Distribution (%) across Paul's Letters:")
tense_cmp

Verb Tense Distribution (%) across Paul's Letters:

Out[8]:

	Present	Aorist	2nd Aorist	Future	Imperfect
book
Romans	52.6	20.6	10.2	8.4	1.3
1 Corinthians	61.2	14.8	9.1	6.0	1.5
2 Corinthians	53.4	23.2	9.0	5.3	1.2
Galatians	49.6	19.4	13.9	4.8	5.3
Ephesians	53.0	28.7	9.1	2.4	1.2
Philippians	57.4	16.0	11.7	5.9	2.0
Colossians	53.8	25.4	8.9	2.5	1.3
Hebrews	46.0	23.6	12.5	5.5	2.9

8. Saving Markdown Reports¶

In [9]:

Copied!





# Save a single book
path = save_profile_report('Dan')
print(f"Saved: {path}")

# Save all NT books
# paths = batch_profiles(testament='NT')
# print(f"Generated {len(paths)} NT profiles")

# Save all 66 books
# paths = batch_profiles()  # saves to output/reports/ot/survey/
# Save a single book
path = save_profile_report('Dan')
print(f"Saved: {path}")

# Save all NT books
# paths = batch_profiles(testament='NT')
# print(f"Generated {len(paths)} NT profiles")

# Save all 66 books
# paths = batch_profiles()  # saves to output/reports/ot/survey/

Saved: /Users/dnovick/gitrepos/projects/bible/bible-grammar-stats/reports/profiles/Dan_profile.md

Quick Reference¶

from bible_grammar.profiles import book_profile, print_profile, save_profile_report, batch_profiles

# Print to console
print_profile('Gen')         # OT Hebrew
print_profile('Rom')         # NT Greek

# Get as dict for analysis
p = book_profile('Isa')
p['total_words']             # int
p['ttr']                     # type-token ratio
p['hapax_count']             # words appearing only once
p['pos_distribution']        # dict of POS → %
p['verb_detail']             # stems (OT) or tense/voice (NT)
p['top_lemmas']              # {strongs: count, ...}
p['baseline']                # corpus average for comparison
p['baseline_delta']          # this book minus corpus average

# Save markdown report
save_profile_report('Gen')              # → output/reports/ot/survey/Gen_profile.md
save_profile_report('Gen', 'out.md')    # custom path

# Batch
batch_profiles(testament='OT')          # all 39 OT books
batch_profiles(testament='NT')          # all 27 NT books
batch_profiles()                        # all 66 books
batch_profiles(book_ids=['Gen','Isa'])  # explicit list