Hebrew OT Register and Style Analysis¶

Quantitative stylometric profiling of Biblical Hebrew books, using metrics that capture vocabulary richness, syntactic register, and morphological fingerprints.

Hebrew style metrics:

Metric	What it captures
TTR	Type-token ratio — raw vocabulary richness
MSTTR (1k window)	Mean segmental TTR — fair cross-length comparison
Hapax density %	Rare/unique vocabulary
Wayyiqtol density %	Narrative momentum (high in narrative, low in poetry/law)
Inf. construct/1k	Subordination and clause-chaining
אֲשֶׁר /1k	Relative clause density
Particle density/1k	Discourse connectivity (כִּי, הִנֵּה, לָכֵן, etc.)

Classic authorship question: Isaiah 1–39 vs. 40–66 — do they cluster separately?

References: Andersen & Forbes, The Computer Bible; Longacre, The Grammar of Discourse.

In [ ]:

Copied!





# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")
# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")

In [ ]:

Copied!





import sys
sys.path.insert(0, '../../../src')

from bible_grammar import (
    msttr, book_style_profile, style_comparison,
    print_style_profile, print_style_comparison,
    style_radar_chart, style_heatmap,
)
import pandas as pd
import sys
sys.path.insert(0, '../../../src')

from bible_grammar import (
    msttr, book_style_profile, style_comparison,
    print_style_profile, print_style_comparison,
    style_radar_chart, style_heatmap,
)
import pandas as pd

1. Overview — Style Metrics¶

Single-book profile showing all style metrics.

In [ ]:

Copied!

# Genesis
print_style_profile('Gen', lang='H')
# Genesis
print_style_profile('Gen', lang='H')

In [ ]:

Copied!

# Psalms — polar opposite of narrative
print_style_profile('Psa', lang='H')
# Psalms — polar opposite of narrative
print_style_profile('Psa', lang='H')

In [ ]:

Copied!

# Deuteronomy — legal/homiletical register
print_style_profile('Deu', lang='H')
# Deuteronomy — legal/homiletical register
print_style_profile('Deu', lang='H')

2. Wayyiqtol Density — Narrative Fingerprint¶

Wayyiqtol density is the single best discriminator between narrative prose (high) and poetry/law/prophecy (low).

In [ ]:

Copied!

# Torah narrative vs. law comparison
torah_books = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
print_style_comparison(torah_books, lang='H')
# Torah narrative vs. law comparison
torah_books = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
print_style_comparison(torah_books, lang='H')

In [ ]:

Copied!

# Historical books — should all have high wayyiqtol density
hist_books = ['Jos', 'Jdg', 'Rut', '1Sa', '2Sa', '1Ki', '2Ki']
print_style_comparison(hist_books, lang='H')
# Historical books — should all have high wayyiqtol density
hist_books = ['Jos', 'Jdg', 'Rut', '1Sa', '2Sa', '1Ki', '2Ki']
print_style_comparison(hist_books, lang='H')

In [ ]:

Copied!





# Wayyiqtol vs. vocabulary richness — narrative vs. poetry/wisdom
diverse = ['Gen', 'Psa', 'Pro', 'Job', 'Deu', 'Isa', 'Jon']
df = style_comparison(diverse, lang='H')
df[['total_tokens', 'msttr_1k', 'wayyiqtol_density_pct', 'hapax_density_pct']].sort_values(
    'wayyiqtol_density_pct', ascending=False
)
# Wayyiqtol vs. vocabulary richness — narrative vs. poetry/wisdom
diverse = ['Gen', 'Psa', 'Pro', 'Job', 'Deu', 'Isa', 'Jon']
df = style_comparison(diverse, lang='H')
df[['total_tokens', 'msttr_1k', 'wayyiqtol_density_pct', 'hapax_density_pct']].sort_values(
    'wayyiqtol_density_pct', ascending=False
)

3. Vocabulary Richness — MSTTR and Hapax Density¶

MSTTR (mean segmental TTR) corrects for book length by averaging TTR over non-overlapping 1,000-token windows. Job and Ezekiel are known for rare vocabulary; Ruth for high TTR despite small size.

In [ ]:

Copied!





# All major OT books — vocabulary richness ranking
all_books = [
    'Gen', 'Exo', 'Lev', 'Num', 'Deu', 'Jos', 'Jdg',
    'Rut', '1Sa', '2Sa', '1Ki', '2Ki', 'Job', 'Psa',
    'Pro', 'Ecc', 'Isa', 'Jer', 'Eze', 'Dan', 'Amo', 'Jon', 'Mal'
]
df_all = style_comparison(all_books, lang='H')
df_all[['total_tokens', 'ttr', 'msttr_1k', 'hapax_density_pct']].sort_values(
    'msttr_1k', ascending=False
).head(15)
# All major OT books — vocabulary richness ranking
all_books = [
    'Gen', 'Exo', 'Lev', 'Num', 'Deu', 'Jos', 'Jdg',
    'Rut', '1Sa', '2Sa', '1Ki', '2Ki', 'Job', 'Psa',
    'Pro', 'Ecc', 'Isa', 'Jer', 'Eze', 'Dan', 'Amo', 'Jon', 'Mal'
]
df_all = style_comparison(all_books, lang='H')
df_all[['total_tokens', 'ttr', 'msttr_1k', 'hapax_density_pct']].sort_values(
    'msttr_1k', ascending=False
).head(15)

In [ ]:

Copied!





# MSTTR for a single book at different window sizes
for window in [500, 1000, 2000]:
    val = msttr('Job', lang='H', window=window)
    print(f"Job MSTTR (window={window:>5}): {val}")
# MSTTR for a single book at different window sizes
for window in [500, 1000, 2000]:
    val = msttr('Job', lang='H', window=window)
    print(f"Job MSTTR (window={window:>5}): {val}")

4. Particle Density — Discourse Connectivity¶

In [ ]:

Copied!





# Particle density across genres
genre_books = ['Gen', 'Deu', 'Psa', 'Pro', 'Job', 'Isa', 'Jer', 'Eze']
df = style_comparison(genre_books, lang='H')
df[['asher_per1k', 'particle_per1k', 'inf_construct_per1k']].sort_values(
    'particle_per1k', ascending=False
)
# Particle density across genres
genre_books = ['Gen', 'Deu', 'Psa', 'Pro', 'Job', 'Isa', 'Jer', 'Eze']
df = style_comparison(genre_books, lang='H')
df[['asher_per1k', 'particle_per1k', 'inf_construct_per1k']].sort_values(
    'particle_per1k', ascending=False
)

5. Isaiah 1–39 vs. 40–66¶

The classic authorship question in OT studies. Proto-Isaiah (1–39) is set in 8th-century Judah; Deutero-Isaiah (40–66) addresses the Babylonian exile. Do the stylometric metrics discriminate between them?

In [ ]:

Copied!

# Full Isaiah profile vs. neighboring prophets
print_style_comparison(['Isa', 'Jer', 'Eze', 'Amo', 'Mic', 'Nah', 'Hab'], lang='H')
# Full Isaiah profile vs. neighboring prophets
print_style_comparison(['Isa', 'Jer', 'Eze', 'Amo', 'Mic', 'Nah', 'Hab'], lang='H')

In [ ]:

Copied!





# Manually split Isaiah by chapter range using raw data
from bible_grammar import load_syntax_ot as load_ot_data

isa_df = load_ot_data()
isa_h = isa_df[(isa_df['book'] == 'Isa') & (isa_df['lang'] == 'H')]

isa1_39 = isa_h[isa_h['chapter'] <= 39]
isa40_66 = isa_h[isa_h['chapter'] >= 40]

def _profile_chunk(df, label):
    from bible_grammar.discourse.stylometrics import _compute_msttr, _hapax_density_pct
    total = len(df)
    lemmas = df['lemma'].tolist()
    return {
        'label': label,
        'tokens': total,
        'ttr': round(len(set(lemmas)) / total, 4),
        'msttr_1k': _compute_msttr(lemmas, 1000),
        'hapax_%': _hapax_density_pct(df),
        'wayyiqtol_%': round((df['type_'] == 'wayyiqtol').sum() / total * 100, 2),
        'asher_1k': round((df['lemma'] == 'אֲשֶׁר').sum() / total * 1000, 2),
    }

comparison = pd.DataFrame([
    _profile_chunk(isa1_39, 'Isa 1–39'),
    _profile_chunk(isa40_66, 'Isa 40–66'),
]).set_index('label')
comparison
# Manually split Isaiah by chapter range using raw data
from bible_grammar import load_syntax_ot as load_ot_data

isa_df = load_ot_data()
isa_h = isa_df[(isa_df['book'] == 'Isa') & (isa_df['lang'] == 'H')]

isa1_39 = isa_h[isa_h['chapter'] <= 39]
isa40_66 = isa_h[isa_h['chapter'] >= 40]

def _profile_chunk(df, label):
    from bible_grammar.discourse.stylometrics import _compute_msttr, _hapax_density_pct
    total = len(df)
    lemmas = df['lemma'].tolist()
    return {
        'label': label,
        'tokens': total,
        'ttr': round(len(set(lemmas)) / total, 4),
        'msttr_1k': _compute_msttr(lemmas, 1000),
        'hapax_%': _hapax_density_pct(df),
        'wayyiqtol_%': round((df['type_'] == 'wayyiqtol').sum() / total * 100, 2),
        'asher_1k': round((df['lemma'] == 'אֲשֶׁר').sum() / total * 1000, 2),
    }

comparison = pd.DataFrame([
    _profile_chunk(isa1_39, 'Isa 1–39'),
    _profile_chunk(isa40_66, 'Isa 40–66'),
]).set_index('label')
comparison

6. Pentateuch Style Comparison — Law vs. Narrative Sections¶

Leviticus and Deuteronomy are primarily legal/homiletical; Genesis, Exodus (1–18), and Numbers have embedded narratives. Do the metrics reflect that?

In [ ]:

Copied!

torah_books = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
print_style_comparison(torah_books, lang='H')
torah_books = ['Gen', 'Exo', 'Lev', 'Num', 'Deu']
print_style_comparison(torah_books, lang='H')

In [ ]:

Copied!

style_radar_chart(torah_books, lang='H')
style_radar_chart(torah_books, lang='H')

In [ ]:

Copied!





# Full OT genre heatmap
ot_sample = [
    'Gen', 'Deu', 'Jos', '1Sa', '2Ki',
    'Job', 'Psa', 'Pro', 'Ecc',
    'Isa', 'Jer', 'Eze', 'Amo', 'Jon'
]
style_heatmap(ot_sample, lang='H')
# Full OT genre heatmap
ot_sample = [
    'Gen', 'Deu', 'Jos', '1Sa', '2Ki',
    'Job', 'Psa', 'Pro', 'Ecc',
    'Isa', 'Jer', 'Eze', 'Amo', 'Jon'
]
style_heatmap(ot_sample, lang='H')

7. Ad-hoc Queries¶

In [ ]:

Copied!





# Minor prophets style comparison
minor_prophets = ['Hos', 'Joe', 'Amo', 'Oba', 'Jon', 'Mic', 'Nah', 'Hab', 'Zep', 'Hag', 'Zec', 'Mal']
df_mp = style_comparison(minor_prophets, lang='H')
df_mp[['total_tokens', 'msttr_1k', 'wayyiqtol_density_pct', 'particle_per1k']].sort_values(
    'msttr_1k', ascending=False
)
# Minor prophets style comparison
minor_prophets = ['Hos', 'Joe', 'Amo', 'Oba', 'Jon', 'Mic', 'Nah', 'Hab', 'Zep', 'Hag', 'Zec', 'Mal']
df_mp = style_comparison(minor_prophets, lang='H')
df_mp[['total_tokens', 'msttr_1k', 'wayyiqtol_density_pct', 'particle_per1k']].sort_values(
    'msttr_1k', ascending=False
)

In [ ]:

Copied!

# Radar: wisdom literature
style_radar_chart(['Job', 'Psa', 'Pro', 'Ecc', 'Sol'], lang='H')
# Radar: wisdom literature
style_radar_chart(['Job', 'Psa', 'Pro', 'Ecc', 'Sol'], lang='H')