Per-Book Language Profile Reports¶
The profiles module generates standardized statistical summaries for any Bible book:
- Word count & vocabulary richness — type-token ratio (TTR), hapax legomena
- Part-of-speech distribution — compared to the full OT or NT corpus average
- Verb analysis — stem (binyan) distribution for Hebrew OT, tense/voice for Greek NT
- Top 20 lexical lemmas — most frequent Strong's numbers in the book
Reports can be printed to the console, returned as a dict for further analysis, or saved as markdown files.
In [1]:
Copied!
import sys
sys.path.insert(0, '../../../src')
import pandas as pd
pd.set_option('display.max_colwidth', 100)
from bible_grammar.profiles import book_profile, print_profile, save_profile_report, batch_profiles
import sys
sys.path.insert(0, '../../../src')
import pandas as pd
pd.set_option('display.max_colwidth', 100)
from bible_grammar.profiles import book_profile, print_profile, save_profile_report, batch_profiles
1. OT Profile: Genesis¶
In [2]:
Copied!
print_profile('Gen')
print_profile('Gen')
============================================================
Genesis (Gen) — Old Testament
============================================================
Words: 20,161 | Unique lemmas: 2,041 | TTR: 0.101 | Hapax: 811
Chapters: 50
Part-of-Speech Distribution (top 8):
Noun 36.4% (-4.7% vs corpus) ██████████████████
Verb 22.4% (++0.9% vs corpus) ███████████
Suffix 17.1% (++2.0% vs corpus) ████████
Particle 11.4% (++0.7% vs corpus) █████
Preposition 3.6% (-0.1% vs corpus) █
Adjective 3.5% (++0.6% vs corpus) █
Pronoun 2.3% (++0.3% vs corpus) █
Adverb 1.7% (++0.3% vs corpus)
c 0.9% (++0.4% vs corpus)
a 0.6% (-0.1% vs corpus)
Verbs: 4,511 total
Stem Distribution:
Qal 78.3% (++7.2% vs corpus)
Hiphil 8.8% (-2.6% vs corpus)
Piel 6.2% (-1.9% vs corpus)
Niphal 4.3% (-1.8% vs corpus)
Hithpael 1.0% (-0.4% vs corpus)
Pual 0.6% (-0.2% vs corpus)
Hophal 0.5% (-0.1% vs corpus)
Haphel 0.5% (++0.3% vs corpus)
Top 10 Lexical Lemmas (Strong's):
H0853 985 occurrences
H0559 582 occurrences
H0413 465 occurrences
H0834A 360 occurrences
H3605 339 occurrences
H1961 313 occurrences
H1121A 308 occurrences
H5921A 297 occurrences
H0776G 286 occurrences
H3588A 267 occurrences
2. OT Profile: Isaiah (the most prophetically verb-rich book)¶
In [3]:
Copied!
print_profile('Isa')
print_profile('Isa')
============================================================
Isaiah (Isa) — Old Testament
============================================================
Words: 16,509 | Unique lemmas: 2,753 | TTR: 0.167 | Hapax: 1,178
Chapters: 66
Part-of-Speech Distribution (top 8):
Noun 38.2% (-2.9% vs corpus) ███████████████████
Verb 25.9% (++4.4% vs corpus) ████████████
Suffix 15.7% (++0.6% vs corpus) ███████
Particle 9.4% (-1.3% vs corpus) ████
Preposition 2.8% (-0.9% vs corpus) █
Adjective 2.6% (-0.3% vs corpus) █
Pronoun 2.3% (++0.3% vs corpus) █
Adverb 1.8% (++0.4% vs corpus)
a 0.9% (++0.2% vs corpus)
c 0.3% (-0.2% vs corpus)
Verbs: 4,274 total
Stem Distribution:
Qal 66.2% (-4.9% vs corpus)
Hiphil 12.2% (++0.8% vs corpus)
Niphal 8.8% (++2.7% vs corpus)
Piel 8.1% (+0.0% vs corpus)
Hithpael 1.8% (++0.4% vs corpus)
Pual 1.8% (++1.0% vs corpus)
Hophal 0.9% (++0.3% vs corpus)
Haphel 0.2% (+0.0% vs corpus)
u 0.0% (-0.1% vs corpus)
Top 10 Lexical Lemmas (Strong's):
H3808 429 occurrences
H3068G 416 occurrences
H5921A 327 occurrences
H3588A 318 occurrences
H0559 246 occurrences
H3605 239 occurrences
H0853 211 occurrences
H1961 202 occurrences
H0413 180 occurrences
H0776G 176 occurrences
3. NT Profile: Romans¶
In [4]:
Copied!
print_profile('Rom')
print_profile('Rom')
============================================================
Romans (Rom) — New Testament
============================================================
Words: 7,175 | Unique lemmas: 1,070 | TTR: 0.149 | Hapax: 583
Chapters: 16
Part-of-Speech Distribution (top 8):
Noun 23.7% (++3.0% vs corpus) ███████████
Verb 16.4% (-3.7% vs corpus) ████████
Article 15.6% (++1.0% vs corpus) ███████
Conjunction 12.9% (-0.1% vs corpus) ██████
Preposition 9.1% (++1.0% vs corpus) ████
Pronoun 6.0% (-2.1% vs corpus) ███
Adjective 5.8% (-0.2% vs corpus) ██
Particle 4.0% (++1.2% vs corpus) ██
R 1.4% (++0.3% vs corpus)
Adverb 1.3% (-0.4% vs corpus)
Verbs: 1,177 total
Tense Distribution:
Present 52.6%
Aorist 20.6%
2nd Aorist 10.2%
Future 8.4%
R 5.9%
Imperfect 1.3%
2R 0.8%
2L 0.1%
Voice Distribution:
Active 68.0%
Passive 16.0%
Deponent 6.5%
N 6.0%
Middle 3.1%
Middle Deponent 0.3%
Top 10 Lexical Lemmas (Strong's):
G3588 1121 occurrences
G2532 292 occurrences
G1722 175 occurrences
G0846 159 occurrences
G2316 151 occurrences
G1161 147 occurrences
G1063 146 occurrences
G4771 139 occurrences
G3756 133 occurrences
G1519 116 occurrences
4. NT Profile: Revelation¶
In [5]:
Copied!
print_profile('Rev')
print_profile('Rev')
============================================================
Revelation (Rev) — New Testament
============================================================
Words: 10,167 | Unique lemmas: 943 | TTR: 0.093 | Hapax: 322
Chapters: 22
Part-of-Speech Distribution (top 8):
Noun 23.6% (++2.9% vs corpus) ███████████
Article 19.3% (++4.7% vs corpus) █████████
Verb 15.8% (-4.3% vs corpus) ███████
Conjunction 13.7% (++0.7% vs corpus) ██████
Adjective 8.2% (++2.2% vs corpus) ████
Preposition 7.4% (-0.7% vs corpus) ███
Pronoun 6.3% (-1.8% vs corpus) ███
Particle 2.0% (-0.8% vs corpus) █
R 0.8% (-0.3% vs corpus)
Adverb 0.8% (-0.9% vs corpus)
Verbs: 1,610 total
Tense Distribution:
Present 39.3%
Aorist 25.8%
2nd Aorist 16.6%
R 7.5%
Future 7.3%
Imperfect 2.7%
2R 0.4%
2nd Future 0.1%
Voice Distribution:
Active 75.1%
Passive 13.9%
N 5.8%
Deponent 3.8%
Middle 1.1%
Middle or Passive 0.2%
Top 10 Lexical Lemmas (Strong's):
G3588 1958 occurrences
G2532 1176 occurrences
G0846 450 occurrences
G1722 165 occurrences
G1909 145 occurrences
G1537 140 occurrences
G1510 121 occurrences
G2192 103 occurrences
G2316 99 occurrences
G3004G 92 occurrences
5. Programmatic Access: book_profile() returns a dict¶
In [6]:
Copied!
p = book_profile('Jhn')
print(f"Book: {p['book_name']}")
print(f"Words: {p['total_words']:,} | Unique lemmas: {p['unique_strongs']:,} | TTR: {p['ttr']:.3f}")
print(f"Hapax legomena: {p['hapax_count']:,} ({p['hapax_count']/p['unique_strongs']*100:.1f}% of vocab)")
print()
# POS as a DataFrame
pos_df = pd.Series(p['pos_distribution'], name='pct').reset_index()
pos_df.columns = ['pos', 'pct']
pos_df['corpus_avg'] = pos_df['pos'].map(p['baseline']['pos_pct'])
pos_df['delta'] = (pos_df['pct'] - pos_df['corpus_avg']).round(1)
print(pos_df.sort_values('pct', ascending=False).to_string(index=False))
p = book_profile('Jhn')
print(f"Book: {p['book_name']}")
print(f"Words: {p['total_words']:,} | Unique lemmas: {p['unique_strongs']:,} | TTR: {p['ttr']:.3f}")
print(f"Hapax legomena: {p['hapax_count']:,} ({p['hapax_count']/p['unique_strongs']*100:.1f}% of vocab)")
print()
# POS as a DataFrame
pos_df = pd.Series(p['pos_distribution'], name='pct').reset_index()
pos_df.columns = ['pos', 'pct']
pos_df['corpus_avg'] = pos_df['pos'].map(p['baseline']['pos_pct'])
pos_df['delta'] = (pos_df['pct'] - pos_df['corpus_avg']).round(1)
print(pos_df.sort_values('pct', ascending=False).to_string(index=False))
Book: John
Words: 16,069 | Unique lemmas: 1,051 | TTR: 0.065
Hapax legomena: 397 (37.8% of vocab)
pos pct corpus_avg delta
Verb 22.7 20.1 2.6
Noun 17.2 20.7 -3.5
Article 14.4 14.6 -0.2
Conjunction 12.8 13.0 -0.2
Pronoun 11.0 8.1 2.9
Preposition 7.0 8.1 -1.1
Adjective 3.8 6.0 -2.2
Particle 3.2 2.8 0.4
D 2.0 1.2 0.8
Adverb 2.0 1.7 0.3
6. Cross-Book Comparison: Verb Stem Distribution across Torah¶
In [7]:
Copied!
torah = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
rows = []
for book_id in torah:
p = book_profile(book_id)
stems = p['verb_detail']['stem_distribution']
row = {'book': p['book_name']}
row.update(stems)
rows.append(row)
stem_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
# Show only major stems
major_stems = ['Qal', 'Niphal', 'Piel', 'Hiphil', 'Pual', 'Hophal', 'Hithpael']
stem_cmp = stem_cmp[[c for c in major_stems if c in stem_cmp.columns]]
print("Verb Stem Distribution (%) across Torah:")
stem_cmp
torah = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
rows = []
for book_id in torah:
p = book_profile(book_id)
stems = p['verb_detail']['stem_distribution']
row = {'book': p['book_name']}
row.update(stems)
rows.append(row)
stem_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
# Show only major stems
major_stems = ['Qal', 'Niphal', 'Piel', 'Hiphil', 'Pual', 'Hophal', 'Hithpael']
stem_cmp = stem_cmp[[c for c in major_stems if c in stem_cmp.columns]]
print("Verb Stem Distribution (%) across Torah:")
stem_cmp
Verb Stem Distribution (%) across Torah:
Out[7]:
| Qal | Niphal | Piel | Hiphil | Pual | Hophal | Hithpael | |
|---|---|---|---|---|---|---|---|
| book | |||||||
| Genesis | 78.3 | 4.3 | 6.2 | 8.8 | 0.6 | 0.5 | 1.0 |
| Exodus | 71.6 | 4.9 | 10.8 | 9.2 | 1.1 | 1.6 | 0.6 |
| Leviticus | 60.9 | 7.3 | 13.6 | 14.9 | 0.4 | 1.5 | 1.2 |
| Numbers | 72.8 | 5.1 | 10.1 | 9.2 | 0.4 | 0.7 | 1.6 |
| Deuteronomy | 74.5 | 5.1 | 8.6 | 9.8 | 0.3 | 0.3 | 1.2 |
7. Cross-Book Comparison: Tense Distribution across Paul's Letters¶
In [8]:
Copied!
pauline = ['Rom', '1Co', '2Co', 'Gal', 'Eph', 'Php', 'Col', 'Heb']
rows = []
for book_id in pauline:
p = book_profile(book_id)
tenses = p['verb_detail'].get('tense_distribution', {})
row = {'book': p['book_name']}
row.update(tenses)
rows.append(row)
tense_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
major_tenses = ['Present', 'Aorist', '2nd Aorist', 'Perfect', 'Future', 'Imperfect']
tense_cmp = tense_cmp[[c for c in major_tenses if c in tense_cmp.columns]]
print("Verb Tense Distribution (%) across Paul's Letters:")
tense_cmp
pauline = ['Rom', '1Co', '2Co', 'Gal', 'Eph', 'Php', 'Col', 'Heb']
rows = []
for book_id in pauline:
p = book_profile(book_id)
tenses = p['verb_detail'].get('tense_distribution', {})
row = {'book': p['book_name']}
row.update(tenses)
rows.append(row)
tense_cmp = pd.DataFrame(rows).set_index('book').fillna(0).round(1)
major_tenses = ['Present', 'Aorist', '2nd Aorist', 'Perfect', 'Future', 'Imperfect']
tense_cmp = tense_cmp[[c for c in major_tenses if c in tense_cmp.columns]]
print("Verb Tense Distribution (%) across Paul's Letters:")
tense_cmp
Verb Tense Distribution (%) across Paul's Letters:
Out[8]:
| Present | Aorist | 2nd Aorist | Future | Imperfect | |
|---|---|---|---|---|---|
| book | |||||
| Romans | 52.6 | 20.6 | 10.2 | 8.4 | 1.3 |
| 1 Corinthians | 61.2 | 14.8 | 9.1 | 6.0 | 1.5 |
| 2 Corinthians | 53.4 | 23.2 | 9.0 | 5.3 | 1.2 |
| Galatians | 49.6 | 19.4 | 13.9 | 4.8 | 5.3 |
| Ephesians | 53.0 | 28.7 | 9.1 | 2.4 | 1.2 |
| Philippians | 57.4 | 16.0 | 11.7 | 5.9 | 2.0 |
| Colossians | 53.8 | 25.4 | 8.9 | 2.5 | 1.3 |
| Hebrews | 46.0 | 23.6 | 12.5 | 5.5 | 2.9 |
8. Saving Markdown Reports¶
In [9]:
Copied!
# Save a single book
path = save_profile_report('Dan')
print(f"Saved: {path}")
# Save all NT books
# paths = batch_profiles(testament='NT')
# print(f"Generated {len(paths)} NT profiles")
# Save all 66 books
# paths = batch_profiles() # saves to output/reports/ot/survey/
# Save a single book
path = save_profile_report('Dan')
print(f"Saved: {path}")
# Save all NT books
# paths = batch_profiles(testament='NT')
# print(f"Generated {len(paths)} NT profiles")
# Save all 66 books
# paths = batch_profiles() # saves to output/reports/ot/survey/
Saved: /Users/dnovick/gitrepos/projects/bible/bible-grammar-stats/reports/profiles/Dan_profile.md
Quick Reference¶
from bible_grammar.profiles import book_profile, print_profile, save_profile_report, batch_profiles
# Print to console
print_profile('Gen') # OT Hebrew
print_profile('Rom') # NT Greek
# Get as dict for analysis
p = book_profile('Isa')
p['total_words'] # int
p['ttr'] # type-token ratio
p['hapax_count'] # words appearing only once
p['pos_distribution'] # dict of POS → %
p['verb_detail'] # stems (OT) or tense/voice (NT)
p['top_lemmas'] # {strongs: count, ...}
p['baseline'] # corpus average for comparison
p['baseline_delta'] # this book minus corpus average
# Save markdown report
save_profile_report('Gen') # → output/reports/ot/survey/Gen_profile.md
save_profile_report('Gen', 'out.md') # custom path
# Batch
batch_profiles(testament='OT') # all 39 OT books
batch_profiles(testament='NT') # all 27 NT books
batch_profiles() # all 66 books
batch_profiles(book_ids=['Gen','Isa']) # explicit list