Data Pipeline — Ingestion, Database, Translations, and Alignment¶
This notebook documents the full pipeline that builds the data assets the rest of the project runs on.
A developer needs this when:
- Setting up the project from scratch
- Regenerating the SQLite/Parquet database after a source data update
- Rebuilding the IBM Model 1 word alignment
- Adding new translation texts
1. Overview — How the Data Pipeline Works¶
The pipeline has four stages:
Ingest (
ingest.py) — Parse the STEPBible TAHOT (Hebrew OT) and TAGNT (Greek NT) TSV source files into pandas DataFrames. Each row is one word token with its morphology decoded into structured columns.Persist (
db.py) — Save those DataFrames as both SQLite (for ad-hoc queries) and Parquet (for fast bulk loading). All subsequent modules load from Parquet; SQLite is provided as a fallback and for external tools.Translations (
translations.py) — Load English (KJV) and Latin (Vulgate Clementine) verse text from scrollmapper JSON files into thetranslationstable in the DB.Alignment (
ibm_align.py/alignment.py) — Build Hebrew word ↔ LXX Greek word translation probabilities. Two approaches are available: fast verse-level co-occurrence (alignment.py) and a slower but more precise IBM Model 1 EM algorithm (ibm_align.py).
STEPBible TSV files
|
v
ingest.py --> db.py --> SQLite + Parquet
|
v
translations.py --> translations table
|
v
ibm_align.py --> word alignment cache
|
v
alignment.py --> translation_equivalents()
Running the full pipeline once is enough; subsequent calls load from cache. Re-run only when source data changes.
import sys
sys.path.insert(0, '../../src')
2. Ingestion — Parsing STEPBible Source Files¶
ingest.py reads the TAHOT and TAGNT TSV files distributed by STEPBible (Tyndale House Cambridge, CC BY 4.0). These are multi-column tab-delimited files where each row is one word token.
load_tahot()— Reads 4 TAHOT files (Genesis–Deuteronomy, Joshua–Esther, Job–Song of Songs, Isaiah–Malachi). Returns one row per Hebrew/Aramaic word with decoded morphology.load_tagnt()— Reads 2 TAGNT files (Matthew–John, Acts–Revelation). Returns one row per Greek word with decoded morphology.load_all()— Convenience wrapper: calls both and concatenates the result.
Key columns produced:
| Column | Content |
|---|---|
book_id |
Three-letter book code (e.g. Gen, Mat) |
chapter, verse, word_num |
Position within the text |
word |
Surface form (Unicode) |
strongs |
Strong's number(s); TAHOT uses braces: {H1234} |
morph_code |
Raw grammar code (e.g. HVqp3ms, V-AAI-3S) |
part_of_speech |
Decoded POS (Verb, Noun, etc.) |
stem |
Hebrew binyan (Qal, Niphal, etc.) — OT only |
conjugation |
Hebrew conjugation (Perfect, Imperfect, etc.) — OT only |
tense, voice, mood |
Greek verbal categories — NT only |
translation |
Per-word gloss from the source file |
These calls are slow (~60–90s combined) because they parse the full corpus from raw TSV. In normal use you call db.load() instead, which loads the pre-built Parquet.
from bible_grammar.ingest import load_tahot, load_tagnt, load_all
# Load Hebrew OT only (slow — parses raw TSV)
# df_heb = load_tahot()
# print('TAHOT shape:', df_heb.shape)
# print('Columns:', list(df_heb.columns))
# df_heb.head(3)
# Load Greek NT only (slow — parses raw TSV)
# df_grk = load_tagnt()
# print('TAGNT shape:', df_grk.shape)
# print('Columns:', list(df_grk.columns))
# df_grk.head(3)
# Load both corpora combined (very slow — full corpus parse)
# df_all = load_all()
# print('Combined shape:', df_all.shape)
# df_all['source'].value_counts()
When to re-run ingestion: Only when the STEPBible source TSV files are updated. After re-parsing, call db.save() to rebuild the Parquet/SQLite cache, then db.invalidate_cache() to clear the in-memory cache.
3. Database — Persisting and Loading¶
db.py provides the canonical load/save interface for all DataFrames in the project. Every analysis module calls db.load() rather than re-parsing TSV files.
Two storage backends:
- Parquet (
data/processed/words.parquet) — default for bulk loading; fast columnar reads. This is whatload()uses when available. - SQLite (
data/processed/bible_grammar.db) — used for ad-hoc SQL queries from external tools (e.g. DB Browser for SQLite). Also a fallback if Parquet is missing.
Both are written together by save(). You only need one call to populate both.
Module-level caching: db.py maintains a module-level cache for each DataFrame (_words_cache, _lxx_cache, _tr_cache). The first load() call in any session reads from disk; subsequent calls return the same object. Call invalidate_cache() after rebuilding the database so the next load() re-reads from disk.
Key paths (defined as module-level constants):
db.DB_PATH— SQLite databasedb.PARQUET_PATH— main words Parquetdb.TRANSLATIONS_PARQUET— translations Parquetdb.LXX_PARQUET— LXX Parquet
from bible_grammar.db import load, load_lxx, load_translations, invalidate_cache, save, is_built
from bible_grammar import db as _db
print('Database already built:', is_built())
print('Parquet path:', _db.PARQUET_PATH)
print('SQLite path: ', _db.DB_PATH)
# Load the main words DataFrame (TAHOT + TAGNT combined)
df = load()
print('Words DataFrame shape:', df.shape)
print()
df.info()
# Source breakdown: TAHOT (Hebrew OT) vs TAGNT (Greek NT)
df['source'].value_counts()
# Load the LXX corpus (Septuagint)
df_lxx = load_lxx()
print('LXX DataFrame shape:', df_lxx.shape)
print('Columns:', list(df_lxx.columns))
df_lxx.head(3)
# Load the translations table (KJV + Vulgate verse text)
df_tr = load_translations()
print('Translations shape:', df_tr.shape)
print('Columns:', list(df_tr.columns))
df_tr.sample(3)
# invalidate_cache() clears all in-memory caches.
# Call this after rebuilding the database so the next load() re-reads from disk.
# invalidate_cache()
# save() writes a words DataFrame to both Parquet and SQLite.
# Only needed after re-running ingestion. Example:
#
# from bible_grammar.ingest import load_all
# df_new = load_all()
# save(df_new)
# invalidate_cache()
4. Translations — Loading English/Latin Texts¶
translations.py loads verse-level text from scrollmapper bible_databases JSON files into the translations table. The module supports:
- KJV (English) —
scrollmapper-data/sources/en/KJV/KJV.json - Vulgate Clementine (Latin) —
scrollmapper-data/sources/la/VulgClementine/VulgClementine.json
The returned DataFrame has columns: translation, language, book_id, chapter, verse, text.
Each row is one Bible verse. Deuterocanonical/apocryphal books present in the Vulgate JSON are silently skipped — only books in our canonical 66-book list are loaded.
After calling load_translations(), persist with db.save_translations() to write to Parquet and SQLite.
When to re-run: Only if the scrollmapper source JSON files are updated or if you add a new translation.
from bible_grammar.translations import load_translations as _load_translations_from_source
# load_translations() in db.py reads from the pre-built Parquet.
# _load_translations_from_source() in translations.py re-parses the JSON source files.
# Only call the latter when rebuilding.
df_tr = load_translations() # from db.py — fast, cached
print('Shape:', df_tr.shape)
print('Translations present:', df_tr['translation'].unique().tolist())
print()
df_tr.head(4)
# Example: look up Genesis 1:1 in both translations
df_tr[(df_tr['book_id'] == 'Gen') & (df_tr['chapter'] == 1) & (df_tr['verse'] == 1)]
# To rebuild from source JSON:
#
# from bible_grammar.translations import load_translations as load_from_json
# from bible_grammar.db import save_translations, invalidate_cache
# df_new = load_from_json()
# save_translations(df_new)
# invalidate_cache()
5. IBM Model 1 Word Alignment — Building Translation Probabilities¶
ibm_align.py trains a statistical word-level translation model on the parallel Hebrew MT / LXX Greek corpus (~23,000 verse pairs). It uses the EM (Expectation-Maximization) algorithm to estimate P(Greek word | Hebrew word) and P(Hebrew word | Greek word) independently, then applies the intersection heuristic: only alignments where both directions agree are kept. This gives high-precision alignments at the cost of some recall.
The output is saved to data/processed/word_alignment.parquet.
Output columns:
| Column | Content |
|---|---|
| book_id, chapter, verse | Verse location |
| heb_word_num, heb_strongs, heb_word | Hebrew token |
| heb_pos, heb_stem | Hebrew morphology |
| lxx_word_num, lxx_strongs, lxx_lemma | LXX Greek token |
| lxx_pos | LXX part of speech |
| p_h2g | P(Greek | Hebrew) from forward model |
| p_g2h | P(Hebrew | Greek) from reverse model |
Performance: Training takes approximately 30 seconds on a modern laptop with 5 EM iterations (the default).
Query functions (in ibm_align.py, not alignment.py):
translation_equivalents_w()— keyword-argument API: find LXX lemmas aligned to a Hebrew strongs/stem/poshebrew_sources_w()— reverse lookup: given a Greek lemma, find Hebrew sources
These are the higher-precision counterparts to translation_equivalents() and hebrew_sources() in alignment.py.
from bible_grammar.ibm_align import load_word_alignment, build_word_alignment
from bible_grammar.ibm_align import translation_equivalents_w, hebrew_sources_w
# Load the pre-built word alignment
df_align = load_word_alignment()
print('Word alignment shape:', df_align.shape)
print('Columns:', list(df_align.columns))
df_align.head(5)
# Summary statistics on the alignment
print('Unique Hebrew strongs:', df_align['heb_strongs'].nunique())
print('Unique LXX strongs: ', df_align['lxx_strongs'].nunique())
print('Books covered: ', df_align['book_id'].nunique())
print()
print('p_h2g range:', df_align['p_h2g'].min(), '—', df_align['p_h2g'].max())
print('p_g2h range:', df_align['p_g2h'].min(), '—', df_align['p_g2h'].max())
# Find LXX equivalents for H7965 (shalom) using word-level alignment
translation_equivalents_w(heb_strongs='H7965', top_n=10)
# Reverse lookup: what Hebrew roots map to G1515 (eirene)?
hebrew_sources_w(lxx_strongs='G1515', top_n=10)
# To rebuild the alignment from scratch (~30 seconds):
# build_word_alignment() # Uncomment to regenerate (~30s)
6. Verse-Level Alignment — Translation Equivalents¶
alignment.py provides a faster but less precise alternative to ibm_align.py. Rather than training a statistical model, it cross-joins every Hebrew word with every Greek word in the same verse to produce co-occurrence counts. This is computationally cheap and requires no training step, but it introduces noise in longer verses (a Greek word may co-occur with many Hebrew words in the same verse even if it only translates one of them).
When to use which:
alignment.py |
ibm_align.py |
|
|---|---|---|
| Speed | Instant (pre-built Parquet) | ~30s to build |
| Precision | Moderate (verse-level noise) | High (word-level) |
| Best for | Quick exploration, high-frequency words | Serious equivalence analysis |
Both modules use the same keyword-argument interface, making them interchangeable for most queries.
The alignment Parquet lives at data/processed/alignment.parquet. It is built by scripts/build_db.py alongside the main words Parquet.
Key functions in alignment.py:
translation_equivalents()— find LXX lemmas that co-occur with a Hebrew word/stem/conjugationhebrew_sources()— reverse: given a Greek lemma, find which Hebrew roots it translatesbuild_alignment()— re-build the cross-join table from current TAHOT + LXX datasave_alignment()— build and persist to Parquetload_alignment()— load the pre-built Parquet
from bible_grammar.alignment import translation_equivalents, hebrew_sources
# Top Greek equivalents for H7965 (shalom) — verse-level co-occurrence
translation_equivalents(heb_strongs='H7965', top_n=10)
# Top Greek equivalents for H2617 (hesed / steadfast love)
translation_equivalents(heb_strongs='H2617', top_n=10)
# Restrict to Niphal stem — what Greek verbs render the Niphal of any root?
translation_equivalents(heb_stem='Niphal', heb_pos='Verb', top_n=10)
# Reverse lookup: which Hebrew roots map to G1515 (eirene)?
hebrew_sources(lxx_strongs='G1515', top_n=10)
# What Hebrew roots does the LXX use poieo (G4160) to translate?
hebrew_sources(lxx_strongs='G4160', top_n=10)
# translation_equivalents() accepts several filters that can be combined:
#
# translation_equivalents(
# heb_strongs='H5414', # natan — to give
# heb_stem='Niphal',
# book_group='prophets',
# top_n=8,
# )
7. Quick Reference¶
# ── ingest.py ─────────────────────────────────────────────────────────────────
from bible_grammar.ingest import load_tahot, load_tagnt, load_all
load_tahot() # -> DataFrame: Hebrew OT word tokens (~425,000 rows)
load_tagnt() # -> DataFrame: Greek NT word tokens (~138,000 rows)
load_all() # -> DataFrame: TAHOT + TAGNT combined
# ── db.py ─────────────────────────────────────────────────────────────────────
from bible_grammar.db import (
load, load_lxx, load_translations,
save, save_translations, save_lxx,
invalidate_cache, is_built,
DB_PATH, PARQUET_PATH, TRANSLATIONS_PARQUET, LXX_PARQUET,
)
load() # -> DataFrame: TAHOT + TAGNT (cached)
load_lxx() # -> DataFrame: LXX corpus (cached)
load_translations() # -> DataFrame: KJV + Vulgate verse text (cached)
invalidate_cache() # Clear all in-memory caches
is_built() # -> bool: Parquet or SQLite exists
save(df) # Write words DataFrame to Parquet + SQLite
save_translations(df) # Write translations DataFrame
save_lxx(df) # Write LXX DataFrame
# ── translations.py ───────────────────────────────────────────────────────────
from bible_grammar.translations import load_translations as load_from_json
load_from_json() # -> DataFrame: parse KJV + Vulgate from scrollmapper JSON
# ── ibm_align.py ─────────────────────────────────────────────────────────────
from bible_grammar.ibm_align import (
build_word_alignment, load_word_alignment,
translation_equivalents_w, hebrew_sources_w,
)
load_word_alignment() # -> DataFrame: pre-built IBM Model 1 alignment
build_word_alignment() # Train + save (~30s); n_iter=5, min_prob=0.1
translation_equivalents_w(heb_strongs=..., heb_stem=..., heb_pos=..., book=..., book_group=..., min_count=2, top_n=20)
hebrew_sources_w(lxx_strongs=..., lxx_lemma=..., book=..., min_count=2, top_n=20)
# ── alignment.py ─────────────────────────────────────────────────────────────
from bible_grammar.alignment import (
translation_equivalents, hebrew_sources,
build_alignment, save_alignment, load_alignment,
)
translation_equivalents(heb_strongs=..., heb_stem=..., heb_conjugation=...,
heb_pos=..., lxx_pos=..., book=..., book_group=...,
min_count=2, top_n=20)
hebrew_sources(lxx_lemma=..., lxx_strongs=..., heb_stem=..., heb_pos=...,
book=..., min_count=2, top_n=20)
build_alignment() # Build verse-level cross-join table
save_alignment() # Build + save to Parquet
load_alignment() # Load pre-built Parquet