Data Pipeline — Ingestion, Database, Translations, and Alignment¶

This notebook documents the full pipeline that builds the data assets the rest of the project runs on.

A developer needs this when:

Setting up the project from scratch
Regenerating the SQLite/Parquet database after a source data update
Rebuilding the IBM Model 1 word alignment
Adding new translation texts

In [ ]:

Copied!





# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")
# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")

1. Overview — How the Data Pipeline Works¶

The pipeline has four stages:

Ingest (ingest.py) — Parse the STEPBible TAHOT (Hebrew OT) and TAGNT (Greek NT) TSV source files into pandas DataFrames. Each row is one word token with its morphology decoded into structured columns.
Persist (db.py) — Save those DataFrames as both SQLite (for ad-hoc queries) and Parquet (for fast bulk loading). All subsequent modules load from Parquet; SQLite is provided as a fallback and for external tools.
Translations (translations.py) — Load English (KJV) and Latin (Vulgate Clementine) verse text from scrollmapper JSON files into the translations table in the DB.
Alignment (ibm_align.py / alignment.py) — Build Hebrew word ↔ LXX Greek word translation probabilities. Two approaches are available: fast verse-level co-occurrence (alignment.py) and a slower but more precise IBM Model 1 EM algorithm (ibm_align.py).

STEPBible TSV files
      |
      v
  ingest.py  -->  db.py  -->  SQLite + Parquet
                                   |
                                   v
                          translations.py  -->  translations table
                                   |
                                   v
                     ibm_align.py  -->  word alignment cache
                                   |
                                   v
                       alignment.py  -->  translation_equivalents()

Running the full pipeline once is enough; subsequent calls load from cache. Re-run only when source data changes.

In [ ]:

Copied!

import sys
sys.path.insert(0, '../../src')
import sys
sys.path.insert(0, '../../src')

2. Ingestion — Parsing STEPBible Source Files¶

ingest.py reads the TAHOT and TAGNT TSV files distributed by STEPBible (Tyndale House Cambridge, CC BY 4.0). These are multi-column tab-delimited files where each row is one word token.

load_tahot() — Reads 4 TAHOT files (Genesis–Deuteronomy, Joshua–Esther, Job–Song of Songs, Isaiah–Malachi). Returns one row per Hebrew/Aramaic word with decoded morphology.
load_tagnt() — Reads 2 TAGNT files (Matthew–John, Acts–Revelation). Returns one row per Greek word with decoded morphology.
load_all() — Convenience wrapper: calls both and concatenates the result.

Key columns produced:

Column	Content
`book_id`	Three-letter book code (e.g. `Gen`, `Mat`)
`chapter`, `verse`, `word_num`	Position within the text
`word`	Surface form (Unicode)
`strongs`	Strong's number(s); TAHOT uses braces: `{H1234}`
`morph_code`	Raw grammar code (e.g. `HVqp3ms`, `V-AAI-3S`)
`part_of_speech`	Decoded POS (`Verb`, `Noun`, etc.)
`stem`	Hebrew binyan (`Qal`, `Niphal`, etc.) — OT only
`conjugation`	Hebrew conjugation (`Perfect`, `Imperfect`, etc.) — OT only
`tense`, `voice`, `mood`	Greek verbal categories — NT only
`translation`	Per-word gloss from the source file

These calls are slow (~60–90s combined) because they parse the full corpus from raw TSV. In normal use you call db.load() instead, which loads the pre-built Parquet.

In [ ]:

Copied!

from bible_grammar.core.ingest import load_tahot, load_tagnt, load_all
from bible_grammar.core.ingest import load_tahot, load_tagnt, load_all

In [ ]:

Copied!





# Load Hebrew OT only (slow — parses raw TSV)
# df_heb = load_tahot()
# print('TAHOT shape:', df_heb.shape)
# print('Columns:', list(df_heb.columns))
# df_heb.head(3)
# Load Hebrew OT only (slow — parses raw TSV)
# df_heb = load_tahot()
# print('TAHOT shape:', df_heb.shape)
# print('Columns:', list(df_heb.columns))
# df_heb.head(3)

In [ ]:

Copied!





# Load Greek NT only (slow — parses raw TSV)
# df_grk = load_tagnt()
# print('TAGNT shape:', df_grk.shape)
# print('Columns:', list(df_grk.columns))
# df_grk.head(3)
# Load Greek NT only (slow — parses raw TSV)
# df_grk = load_tagnt()
# print('TAGNT shape:', df_grk.shape)
# print('Columns:', list(df_grk.columns))
# df_grk.head(3)

In [ ]:

Copied!





# Load both corpora combined (very slow — full corpus parse)
# df_all = load_all()
# print('Combined shape:', df_all.shape)
# df_all['source'].value_counts()
# Load both corpora combined (very slow — full corpus parse)
# df_all = load_all()
# print('Combined shape:', df_all.shape)
# df_all['source'].value_counts()

When to re-run ingestion: Only when the STEPBible source TSV files are updated. After re-parsing, call db.save() to rebuild the Parquet/SQLite cache, then db.invalidate_cache() to clear the in-memory cache.

3. Database — Persisting and Loading¶

db.py provides the canonical load/save interface for all DataFrames in the project. Every analysis module calls db.load() rather than re-parsing TSV files.

Two storage backends:

Parquet (data/processed/words.parquet) — default for bulk loading; fast columnar reads. This is what load() uses when available.
SQLite (data/processed/bible_grammar.db) — used for ad-hoc SQL queries from external tools (e.g. DB Browser for SQLite). Also a fallback if Parquet is missing.

Both are written together by save(). You only need one call to populate both.

Module-level caching: db.py maintains a module-level cache for each DataFrame (_words_cache, _lxx_cache, _tr_cache). The first load() call in any session reads from disk; subsequent calls return the same object. Call invalidate_cache() after rebuilding the database so the next load() re-reads from disk.

Key paths (defined as module-level constants):

db.DB_PATH — SQLite database
db.PARQUET_PATH — main words Parquet
db.TRANSLATIONS_PARQUET — translations Parquet
db.LXX_PARQUET — LXX Parquet

In [ ]:

Copied!





from bible_grammar.core.db import load, load_lxx, load_translations, invalidate_cache, save, is_built
from bible_grammar import db as _db

print('Database already built:', is_built())
print('Parquet path:', _db.PARQUET_PATH)
print('SQLite path: ', _db.DB_PATH)
from bible_grammar.core.db import load, load_lxx, load_translations, invalidate_cache, save, is_built
from bible_grammar import db as _db

print('Database already built:', is_built())
print('Parquet path:', _db.PARQUET_PATH)
print('SQLite path: ', _db.DB_PATH)

In [ ]:

Copied!





# Load the main words DataFrame (TAHOT + TAGNT combined)
df = load()
print('Words DataFrame shape:', df.shape)
print()
df.info()
# Load the main words DataFrame (TAHOT + TAGNT combined)
df = load()
print('Words DataFrame shape:', df.shape)
print()
df.info()

In [ ]:

Copied!

# Source breakdown: TAHOT (Hebrew OT) vs TAGNT (Greek NT)
df['source'].value_counts()
# Source breakdown: TAHOT (Hebrew OT) vs TAGNT (Greek NT)
df['source'].value_counts()

In [ ]:

Copied!





# Load the LXX corpus (Septuagint)
df_lxx = load_lxx()
print('LXX DataFrame shape:', df_lxx.shape)
print('Columns:', list(df_lxx.columns))
df_lxx.head(3)
# Load the LXX corpus (Septuagint)
df_lxx = load_lxx()
print('LXX DataFrame shape:', df_lxx.shape)
print('Columns:', list(df_lxx.columns))
df_lxx.head(3)

In [ ]:

Copied!





# Load the translations table (KJV + Vulgate verse text)
df_tr = load_translations()
print('Translations shape:', df_tr.shape)
print('Columns:', list(df_tr.columns))
df_tr.sample(3)
# Load the translations table (KJV + Vulgate verse text)
df_tr = load_translations()
print('Translations shape:', df_tr.shape)
print('Columns:', list(df_tr.columns))
df_tr.sample(3)

In [ ]:

Copied!

# invalidate_cache() clears all in-memory caches.
# Call this after rebuilding the database so the next load() re-reads from disk.
# invalidate_cache()
# invalidate_cache() clears all in-memory caches.
# Call this after rebuilding the database so the next load() re-reads from disk.
# invalidate_cache()

In [ ]:

Copied!





# save() writes a words DataFrame to both Parquet and SQLite.
# Only needed after re-running ingestion. Example:
#
#   from bible_grammar.core.ingest import load_all
#   df_new = load_all()
#   save(df_new)
#   invalidate_cache()
# save() writes a words DataFrame to both Parquet and SQLite.
# Only needed after re-running ingestion. Example:
#
#   from bible_grammar.core.ingest import load_all
#   df_new = load_all()
#   save(df_new)
#   invalidate_cache()

4. Translations — Loading English/Latin Texts¶

translations.py loads verse-level text from scrollmapper bible_databases JSON files into the translations table. The module supports:

KJV (English) — scrollmapper-data/sources/en/KJV/KJV.json
Vulgate Clementine (Latin) — scrollmapper-data/sources/la/VulgClementine/VulgClementine.json

The returned DataFrame has columns: translation, language, book_id, chapter, verse, text.

Each row is one Bible verse. Deuterocanonical/apocryphal books present in the Vulgate JSON are silently skipped — only books in our canonical 66-book list are loaded.

After calling load_translations(), persist with db.save_translations() to write to Parquet and SQLite.

When to re-run: Only if the scrollmapper source JSON files are updated or if you add a new translation.

In [ ]:

Copied!





from bible_grammar.core.translations import load_translations as _load_translations_from_source

# load_translations() in db.py reads from the pre-built Parquet.
# _load_translations_from_source() in translations.py re-parses the JSON source files.
# Only call the latter when rebuilding.

df_tr = load_translations()  # from db.py — fast, cached
print('Shape:', df_tr.shape)
print('Translations present:', df_tr['translation'].unique().tolist())
print()
df_tr.head(4)
from bible_grammar.core.translations import load_translations as _load_translations_from_source

# load_translations() in db.py reads from the pre-built Parquet.
# _load_translations_from_source() in translations.py re-parses the JSON source files.
# Only call the latter when rebuilding.

df_tr = load_translations()  # from db.py — fast, cached
print('Shape:', df_tr.shape)
print('Translations present:', df_tr['translation'].unique().tolist())
print()
df_tr.head(4)

In [ ]:

Copied!

# Example: look up Genesis 1:1 in both translations
df_tr[(df_tr['book_id'] == 'Gen') & (df_tr['chapter'] == 1) & (df_tr['verse'] == 1)]
# Example: look up Genesis 1:1 in both translations
df_tr[(df_tr['book_id'] == 'Gen') & (df_tr['chapter'] == 1) & (df_tr['verse'] == 1)]

In [ ]:

Copied!





# To rebuild from source JSON:
#
#   from bible_grammar.core.translations import load_translations as load_from_json
#   from bible_grammar.core.db import save_translations, invalidate_cache
#   df_new = load_from_json()
#   save_translations(df_new)
#   invalidate_cache()
# To rebuild from source JSON:
#
#   from bible_grammar.core.translations import load_translations as load_from_json
#   from bible_grammar.core.db import save_translations, invalidate_cache
#   df_new = load_from_json()
#   save_translations(df_new)
#   invalidate_cache()

5. IBM Model 1 Word Alignment — Building Translation Probabilities¶

ibm_align.py trains a statistical word-level translation model on the parallel Hebrew MT / LXX Greek corpus (~23,000 verse pairs). It uses the EM (Expectation-Maximization) algorithm to estimate P(Greek word | Hebrew word) and P(Hebrew word | Greek word) independently, then applies the intersection heuristic: only alignments where both directions agree are kept. This gives high-precision alignments at the cost of some recall.

The output is saved to data/processed/word_alignment.parquet.

Output columns:

| Column | Content | |---|---| | book_id, chapter, verse | Verse location | | heb_word_num, heb_strongs, heb_word | Hebrew token | | heb_pos, heb_stem | Hebrew morphology | | lxx_word_num, lxx_strongs, lxx_lemma | LXX Greek token | | lxx_pos | LXX part of speech | | p_h2g | P(Greek | Hebrew) from forward model | | p_g2h | P(Hebrew | Greek) from reverse model |

Performance: Training takes approximately 30 seconds on a modern laptop with 5 EM iterations (the default).

Query functions (in ibm_align.py, not alignment.py):

translation_equivalents_w() — keyword-argument API: find LXX lemmas aligned to a Hebrew strongs/stem/pos
hebrew_sources_w() — reverse lookup: given a Greek lemma, find Hebrew sources

These are the higher-precision counterparts to translation_equivalents() and hebrew_sources() in alignment.py.

In [ ]:

Copied!

from bible_grammar.core.ibm_align import load_word_alignment, build_word_alignment
from bible_grammar.core.ibm_align import translation_equivalents_w, hebrew_sources_w
from bible_grammar.core.ibm_align import load_word_alignment, build_word_alignment
from bible_grammar.core.ibm_align import translation_equivalents_w, hebrew_sources_w

In [ ]:

Copied!





# Load the pre-built word alignment
df_align = load_word_alignment()
print('Word alignment shape:', df_align.shape)
print('Columns:', list(df_align.columns))
df_align.head(5)
# Load the pre-built word alignment
df_align = load_word_alignment()
print('Word alignment shape:', df_align.shape)
print('Columns:', list(df_align.columns))
df_align.head(5)

In [ ]:

Copied!





# Summary statistics on the alignment
print('Unique Hebrew strongs:', df_align['heb_strongs'].nunique())
print('Unique LXX strongs:   ', df_align['lxx_strongs'].nunique())
print('Books covered:        ', df_align['book_id'].nunique())
print()
print('p_h2g range:', df_align['p_h2g'].min(), '—', df_align['p_h2g'].max())
print('p_g2h range:', df_align['p_g2h'].min(), '—', df_align['p_g2h'].max())
# Summary statistics on the alignment
print('Unique Hebrew strongs:', df_align['heb_strongs'].nunique())
print('Unique LXX strongs:   ', df_align['lxx_strongs'].nunique())
print('Books covered:        ', df_align['book_id'].nunique())
print()
print('p_h2g range:', df_align['p_h2g'].min(), '—', df_align['p_h2g'].max())
print('p_g2h range:', df_align['p_g2h'].min(), '—', df_align['p_g2h'].max())

In [ ]:

Copied!

# Find LXX equivalents for H7965 (shalom) using word-level alignment
translation_equivalents_w(heb_strongs='H7965', top_n=10)
# Find LXX equivalents for H7965 (shalom) using word-level alignment
translation_equivalents_w(heb_strongs='H7965', top_n=10)

In [ ]:

Copied!

# Reverse lookup: what Hebrew roots map to G1515 (eirene)?
hebrew_sources_w(lxx_strongs='G1515', top_n=10)
# Reverse lookup: what Hebrew roots map to G1515 (eirene)?
hebrew_sources_w(lxx_strongs='G1515', top_n=10)

In [ ]:

Copied!

# To rebuild the alignment from scratch (~30 seconds):
# build_word_alignment()  # Uncomment to regenerate (~30s)
# To rebuild the alignment from scratch (~30 seconds):
# build_word_alignment()  # Uncomment to regenerate (~30s)

6. Verse-Level Alignment — Translation Equivalents¶

alignment.py provides a faster but less precise alternative to ibm_align.py. Rather than training a statistical model, it cross-joins every Hebrew word with every Greek word in the same verse to produce co-occurrence counts. This is computationally cheap and requires no training step, but it introduces noise in longer verses (a Greek word may co-occur with many Hebrew words in the same verse even if it only translates one of them).

When to use which:

	`alignment.py`	`ibm_align.py`
Speed	Instant (pre-built Parquet)	~30s to build
Precision	Moderate (verse-level noise)	High (word-level)
Best for	Quick exploration, high-frequency words	Serious equivalence analysis

Both modules use the same keyword-argument interface, making them interchangeable for most queries.

The alignment Parquet lives at data/processed/alignment.parquet. It is built by scripts/build_db.py alongside the main words Parquet.

Key functions in alignment.py:

translation_equivalents() — find LXX lemmas that co-occur with a Hebrew word/stem/conjugation
hebrew_sources() — reverse: given a Greek lemma, find which Hebrew roots it translates
build_alignment() — re-build the cross-join table from current TAHOT + LXX data
save_alignment() — build and persist to Parquet
load_alignment() — load the pre-built Parquet

In [ ]:

Copied!

from bible_grammar.core.alignment import translation_equivalents, hebrew_sources
from bible_grammar.core.alignment import translation_equivalents, hebrew_sources

In [ ]:

Copied!

# Top Greek equivalents for H7965 (shalom) — verse-level co-occurrence
translation_equivalents(heb_strongs='H7965', top_n=10)
# Top Greek equivalents for H7965 (shalom) — verse-level co-occurrence
translation_equivalents(heb_strongs='H7965', top_n=10)

In [ ]:

Copied!

# Top Greek equivalents for H2617 (hesed / steadfast love)
translation_equivalents(heb_strongs='H2617', top_n=10)
# Top Greek equivalents for H2617 (hesed / steadfast love)
translation_equivalents(heb_strongs='H2617', top_n=10)

In [ ]:

Copied!

# Restrict to Niphal stem — what Greek verbs render the Niphal of any root?
translation_equivalents(heb_stem='Niphal', heb_pos='Verb', top_n=10)
# Restrict to Niphal stem — what Greek verbs render the Niphal of any root?
translation_equivalents(heb_stem='Niphal', heb_pos='Verb', top_n=10)

In [ ]:

Copied!

# Reverse lookup: which Hebrew roots map to G1515 (eirene)?
hebrew_sources(lxx_strongs='G1515', top_n=10)
# Reverse lookup: which Hebrew roots map to G1515 (eirene)?
hebrew_sources(lxx_strongs='G1515', top_n=10)

In [ ]:

Copied!

# What Hebrew roots does the LXX use poieo (G4160) to translate?
hebrew_sources(lxx_strongs='G4160', top_n=10)
# What Hebrew roots does the LXX use poieo (G4160) to translate?
hebrew_sources(lxx_strongs='G4160', top_n=10)

In [ ]:

Copied!





# translation_equivalents() accepts several filters that can be combined:
#
#   translation_equivalents(
#       heb_strongs='H5414',      # natan — to give
#       heb_stem='Niphal',
#       book_group='prophets',
#       top_n=8,
#   )
# translation_equivalents() accepts several filters that can be combined:
#
#   translation_equivalents(
#       heb_strongs='H5414',      # natan — to give
#       heb_stem='Niphal',
#       book_group='prophets',
#       top_n=8,
#   )

7. Quick Reference¶

# ── ingest.py ─────────────────────────────────────────────────────────────────
from bible_grammar.ingest import load_tahot, load_tagnt, load_all
load_tahot()           # -> DataFrame: Hebrew OT word tokens (~425,000 rows)
load_tagnt()           # -> DataFrame: Greek NT word tokens (~138,000 rows)
load_all()             # -> DataFrame: TAHOT + TAGNT combined

# ── db.py ─────────────────────────────────────────────────────────────────────
from bible_grammar.db import (
    load, load_lxx, load_translations,
    save, save_translations, save_lxx,
    invalidate_cache, is_built,
    DB_PATH, PARQUET_PATH, TRANSLATIONS_PARQUET, LXX_PARQUET,
)
load()                 # -> DataFrame: TAHOT + TAGNT (cached)
load_lxx()             # -> DataFrame: LXX corpus (cached)
load_translations()    # -> DataFrame: KJV + Vulgate verse text (cached)
invalidate_cache()     # Clear all in-memory caches
is_built()             # -> bool: Parquet or SQLite exists
save(df)               # Write words DataFrame to Parquet + SQLite
save_translations(df)  # Write translations DataFrame
save_lxx(df)           # Write LXX DataFrame

# ── translations.py ───────────────────────────────────────────────────────────
from bible_grammar.translations import load_translations as load_from_json
load_from_json()       # -> DataFrame: parse KJV + Vulgate from scrollmapper JSON

# ── ibm_align.py ─────────────────────────────────────────────────────────────
from bible_grammar.ibm_align import (
    build_word_alignment, load_word_alignment,
    translation_equivalents_w, hebrew_sources_w,
)
load_word_alignment()  # -> DataFrame: pre-built IBM Model 1 alignment
build_word_alignment() # Train + save (~30s); n_iter=5, min_prob=0.1
translation_equivalents_w(heb_strongs=..., heb_stem=..., heb_pos=..., book=..., book_group=..., min_count=2, top_n=20)
hebrew_sources_w(lxx_strongs=..., lxx_lemma=..., book=..., min_count=2, top_n=20)

# ── alignment.py ─────────────────────────────────────────────────────────────
from bible_grammar.alignment import (
    translation_equivalents, hebrew_sources,
    build_alignment, save_alignment, load_alignment,
)
translation_equivalents(heb_strongs=..., heb_stem=..., heb_conjugation=...,
                        heb_pos=..., lxx_pos=..., book=..., book_group=...,
                        min_count=2, top_n=20)
hebrew_sources(lxx_lemma=..., lxx_strongs=..., heb_stem=..., heb_pos=...,
               book=..., min_count=2, top_n=20)
build_alignment()      # Build verse-level cross-join table
save_alignment()       # Build + save to Parquet
load_alignment()       # Load pre-built Parquet