Hebrew OT Register and Style Analysis¶
Quantitative stylometric profiling of Biblical Hebrew books, using metrics that capture vocabulary richness, syntactic register, and morphological fingerprints.
Hebrew style metrics:
| Metric | What it captures |
|---|---|
| TTR | Type-token ratio — raw vocabulary richness |
| MSTTR (1k window) | Mean segmental TTR — fair cross-length comparison |
| Hapax density % | Rare/unique vocabulary |
| Wayyiqtol density % | Narrative momentum (high in narrative, low in poetry/law) |
| Inf. construct/1k | Subordination and clause-chaining |
| אֲשֶׁר /1k | Relative clause density |
| Particle density/1k | Discourse connectivity (כִּי, הִנֵּה, לָכֵן, etc.) |
Classic authorship question: Isaiah 1–39 vs. 40–66 — do they cluster separately?
References: Andersen & Forbes, The Computer Bible; Longacre, The Grammar of Discourse.
import sys
sys.path.insert(0, '../../../src')
from bible_grammar import (
msttr, book_style_profile, style_comparison,
print_style_profile, print_style_comparison,
style_radar_chart, style_heatmap,
)
import pandas as pd
1. Overview — Style Metrics¶
Single-book profile showing all style metrics.
# Genesis
print_style_profile('Gen', lang='H')
# Psalms — polar opposite of narrative
print_style_profile('Psa', lang='H')
# Deuteronomy — legal/homiletical register
print_style_profile('Deu', lang='H')
2. Wayyiqtol Density — Narrative Fingerprint¶
Wayyiqtol density is the single best discriminator between narrative prose (high) and poetry/law/prophecy (low).
# Torah narrative vs. law comparison
torah_books = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
print_style_comparison(torah_books, lang='H')
# Historical books — should all have high wayyiqtol density
hist_books = ['Jos', 'Jdg', 'Rut', '1Sa', '2Sa', '1Ki', '2Ki']
print_style_comparison(hist_books, lang='H')
# Wayyiqtol vs. vocabulary richness — narrative vs. poetry/wisdom
diverse = ['Gen', 'Psa', 'Pro', 'Job', 'Deu', 'Isa', 'Jon']
df = style_comparison(diverse, lang='H')
df[['total_tokens', 'msttr_1k', 'wayyiqtol_density_pct', 'hapax_density_pct']].sort_values(
'wayyiqtol_density_pct', ascending=False
)
3. Vocabulary Richness — MSTTR and Hapax Density¶
MSTTR (mean segmental TTR) corrects for book length by averaging TTR over non-overlapping 1,000-token windows. Job and Ezekiel are known for rare vocabulary; Ruth for high TTR despite small size.
# All major OT books — vocabulary richness ranking
all_books = [
'Gen', 'Exo', 'Lev', 'Num', 'Deu', 'Jos', 'Jdg',
'Rut', '1Sa', '2Sa', '1Ki', '2Ki', 'Job', 'Psa',
'Pro', 'Ecc', 'Isa', 'Jer', 'Eze', 'Dan', 'Amo', 'Jon', 'Mal'
]
df_all = style_comparison(all_books, lang='H')
df_all[['total_tokens', 'ttr', 'msttr_1k', 'hapax_density_pct']].sort_values(
'msttr_1k', ascending=False
).head(15)
# MSTTR for a single book at different window sizes
for window in [500, 1000, 2000]:
val = msttr('Job', lang='H', window=window)
print(f"Job MSTTR (window={window:>5}): {val}")
4. Particle Density — Discourse Connectivity¶
# Particle density across genres
genre_books = ['Gen', 'Deu', 'Psa', 'Pro', 'Job', 'Isa', 'Jer', 'Eze']
df = style_comparison(genre_books, lang='H')
df[['asher_per1k', 'particle_per1k', 'inf_construct_per1k']].sort_values(
'particle_per1k', ascending=False
)
5. Isaiah 1–39 vs. 40–66¶
The classic authorship question in OT studies. Proto-Isaiah (1–39) is set in 8th-century Judah; Deutero-Isaiah (40–66) addresses the Babylonian exile. Do the stylometric metrics discriminate between them?
# Full Isaiah profile vs. neighboring prophets
print_style_comparison(['Isa', 'Jer', 'Eze', 'Amo', 'Mic', 'Nah', 'Hab'], lang='H')
# Manually split Isaiah by chapter range using raw data
from bible_grammar._utils import load_ot_data
isa_df = load_ot_data()
isa_h = isa_df[(isa_df['book'] == 'Isa') & (isa_df['lang'] == 'H')]
isa1_39 = isa_h[isa_h['chapter'] <= 39]
isa40_66 = isa_h[isa_h['chapter'] >= 40]
def _profile_chunk(df, label):
from bible_grammar.stylometrics import _compute_msttr, _hapax_density_pct
total = len(df)
lemmas = df['lemma'].tolist()
return {
'label': label,
'tokens': total,
'ttr': round(len(set(lemmas)) / total, 4),
'msttr_1k': _compute_msttr(lemmas, 1000),
'hapax_%': _hapax_density_pct(df),
'wayyiqtol_%': round((df['type_'] == 'wayyiqtol').sum() / total * 100, 2),
'asher_1k': round((df['lemma'] == 'אֲשֶׁר').sum() / total * 1000, 2),
}
comparison = pd.DataFrame([
_profile_chunk(isa1_39, 'Isa 1–39'),
_profile_chunk(isa40_66, 'Isa 40–66'),
]).set_index('label')
comparison
6. Pentateuch Style Comparison — Law vs. Narrative Sections¶
Leviticus and Deuteronomy are primarily legal/homiletical; Genesis, Exodus (1–18), and Numbers have embedded narratives. Do the metrics reflect that?
torah_books = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
print_style_comparison(torah_books, lang='H')
style_radar_chart(torah_books, lang='H')
# Full OT genre heatmap
ot_sample = [
'Gen', 'Deu', 'Jos', '1Sa', '2Ki',
'Job', 'Psa', 'Pro', 'Ecc',
'Isa', 'Jer', 'Eze', 'Amo', 'Jon'
]
style_heatmap(ot_sample, lang='H')
7. Ad-hoc Queries¶
# Minor prophets style comparison
minor_prophets = ['Hos', 'Joe', 'Amo', 'Oba', 'Jon', 'Mic', 'Nah', 'Hab', 'Zep', 'Hag', 'Zec', 'Mal']
df_mp = style_comparison(minor_prophets, lang='H')
df_mp[['total_tokens', 'msttr_1k', 'wayyiqtol_density_pct', 'particle_per1k']].sort_values(
'msttr_1k', ascending=False
)
# Radar: wisdom literature
style_radar_chart(['Job', 'Psa', 'Pro', 'Ecc', 'Sol'], lang='H')