Morphology Code Reference — Decoding STEPBible Grammar Tags¶

This notebook documents the morphology module, which decodes the raw STEPBible grammar codes in the morph_code column into structured Python dicts.

This is useful for developers who are:

Adding new analysis features that filter on specific morphological forms
Debugging unexpected values in the stem, conjugation, tense, etc. columns
Understanding why a particular word token has the POS or stem it does
Extending the decoder to handle edge cases or new code patterns

In [ ]:

Copied!





# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")
# @title Colab setup (runs only on Google Colab)
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    import subprocess, os
    # Clone the repo so all source and data paths work
    if not os.path.isdir("/content/berean-bible-bots"):
        subprocess.run(
            ["git", "clone", "--depth", "1",
             "https://github.com/dnovick/berean-bible-bots.git",
             "/content/berean-bible-bots"],
            check=True,
        )
    os.chdir("/content/berean-bible-bots")
    sys.path.insert(0, "/content/berean-bible-bots/src")
    # Install Python dependencies
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r",
         "binder/requirements.txt"],
        check=True,
    )
    # Download processed data files (~295 MB, one-time)
    subprocess.run(["bash", "binder/postBuild"], check=True)
    print("Colab environment ready.")

1. Overview¶

Why raw codes exist¶

The STEPBible TAHOT and TAGNT TSV files store morphology as compact grammar codes in a single cell. These codes are parsed at ingest time by morphology.py and expanded into the structured columns in the main DataFrame (part_of_speech, stem, conjugation, tense, voice, mood, etc.). The raw code is preserved in the morph_code column for auditability.

Code format examples¶

Hebrew: HVqp3ms  = Hebrew Verb Qal Perfect 3ms
        HNcmsa   = Hebrew Noun common masc sing absolute
        HVhp3ms  = Hebrew Verb Hiphil Perfect 3ms
        HVNp3ms  = Hebrew Verb Niphal Perfect 3ms
        HC/Td/Ncfsa  = prefix conjunction + prefix article + Noun (slash-joined)

Greek:  V-AAI-3S = Verb Aorist Active Indicative 3rd Singular
        N-NSF    = Noun Nominative Singular Feminine
        V-PAP-NSM = Verb Present Active Participle Nominative Singular Masculine
        ADV      = Adverb (uninflected)
        CONJ     = Conjunction (uninflected)

When to use this module¶

In most cases you work with the already-decoded columns in the main DataFrame. You only need morphology.py directly when:

Decoding a raw code that came from outside the normal load path
Inspecting what a specific code in morph_code means
Writing tests for the decoder
Extending the decoder for a new code pattern

The decoder is also called internally by ingest.py during the initial parse.

In [ ]:

Copied!

import sys
sys.path.insert(0, '../../src')
import sys
sys.path.insert(0, '../../src')

2. Decoding Hebrew Morphology¶

decode_hebrew(morph_code) takes a raw TAHOT grammar code and returns a dict. It handles slash-joined tokens (prefix chains like HC/Td/Ncfsa): the main word is always the last (rightmost) token; prefixes are collected into the 'prefixes' key as a +-joined string.

Hebrew code structure (single token):

Position 0: Language  H=Hebrew  A=Aramaic
Position 1: Function  V=Verb  N=Noun  A=Adjective  R=Preposition
                      C=Conjunction  D=Adverb  T=Particle  P=Pronoun
For Verb (HV...):
  Position 2: Stem     q=Qal  N=Niphal  p=Piel  P=Pual  h=Hiphil  H=Hophal  t=Hithpael
  Position 3: Form     p=Perfect  i=Imperfect  w=ConsecPerf  q=ConsecImpf
                       v=Imperative  r=Participle  c=InfConstruct  a=InfAbsolute
  Position 4: Person   1  2  3
  Position 5: Gender   m=Masc  f=Fem  c=Common
  Position 6: Number   s=Sing  p=Plural  d=Dual
For Noun (HN...):
  Position 2: Type     c=Common  p=Proper  g=Gentilic
  Position 3: Gender   m=Masc  f=Fem  c=Common
  Position 4: Number   s  p  d
  Position 5: State    a=Absolute  c=Construct  d=Definite

In [ ]:

Copied!

from bible_grammar.core.morphology import decode_hebrew
from bible_grammar.core.morphology import decode_hebrew

In [ ]:

Copied!

# Verb Qal Perfect 3ms
decode_hebrew('HVqp3ms')
# Verb Qal Perfect 3ms
decode_hebrew('HVqp3ms')

In [ ]:

Copied!

# Noun common masc sing absolute
decode_hebrew('HNcmsa')
# Noun common masc sing absolute
decode_hebrew('HNcmsa')

In [ ]:

Copied!

# Hiphil Perfect 3ms
decode_hebrew('HVhp3ms')
# Hiphil Perfect 3ms
decode_hebrew('HVhp3ms')

In [ ]:

Copied!

# Niphal Perfect 3ms
decode_hebrew('HVNp3ms')
# Niphal Perfect 3ms
decode_hebrew('HVNp3ms')

In [ ]:

Copied!





# Several more codes to illustrate the full range
codes = [
    ('HVpi3ms', 'Piel Imperfect 3ms'),
    ('HVPp3ms', 'Pual Perfect 3ms'),
    ('HVtp3ms', 'Hithpael Perfect 3ms'),
    ('HVHi3ms', 'Hophal Imperfect 3ms'),
    ('HVqr',    'Qal Participle'),
    ('HVqc',    'Qal Infinitive Construct'),
    ('HVqa',    'Qal Infinitive Absolute'),
    ('HNcfsa',  'Noun common fem sing absolute'),
    ('HNpmsa',  'Noun proper masc sing absolute'),
    ('HAmfsa',  'Adjective masc sing absolute'),
]

import pandas as pd
rows = []
for code, label in codes:
    d = decode_hebrew(code)
    d['_code'] = code
    d['_label'] = label
    rows.append(d)

pd.DataFrame(rows).set_index('_code')[[
    '_label', 'part_of_speech', 'stem', 'conjugation', 'person', 'gender', 'number', 'state'
]]
# Several more codes to illustrate the full range
codes = [
    ('HVpi3ms', 'Piel Imperfect 3ms'),
    ('HVPp3ms', 'Pual Perfect 3ms'),
    ('HVtp3ms', 'Hithpael Perfect 3ms'),
    ('HVHi3ms', 'Hophal Imperfect 3ms'),
    ('HVqr',    'Qal Participle'),
    ('HVqc',    'Qal Infinitive Construct'),
    ('HVqa',    'Qal Infinitive Absolute'),
    ('HNcfsa',  'Noun common fem sing absolute'),
    ('HNpmsa',  'Noun proper masc sing absolute'),
    ('HAmfsa',  'Adjective masc sing absolute'),
]

import pandas as pd
rows = []
for code, label in codes:
    d = decode_hebrew(code)
    d['_code'] = code
    d['_label'] = label
    rows.append(d)

pd.DataFrame(rows).set_index('_code')[[
    '_label', 'part_of_speech', 'stem', 'conjugation', 'person', 'gender', 'number', 'state'
]]

In [ ]:

Copied!

# Slash-joined prefix chain: conjunction + article + noun
decode_hebrew('HC/Td/Ncfsa')
# Slash-joined prefix chain: conjunction + article + noun
decode_hebrew('HC/Td/Ncfsa')

In [ ]:

Copied!

# Aramaic verb (prefix A instead of H)
decode_hebrew('AVbp3ms')  # Peal Perfect 3ms
# Aramaic verb (prefix A instead of H)
decode_hebrew('AVbp3ms')  # Peal Perfect 3ms

3. Decoding Greek Morphology¶

decode_greek(grammar_field) takes the grammar portion of a TAGNT dStrongs=Grammar cell (i.e., everything after the =) and returns a dict.

Greek code structure:

Uninflected:  ADV  CONJ  PREP  PRT  INJ  COND

Nouns/Adj/Article:  <func>-<case><number><gender>
  N-NSF  = Noun Nominative Singular Feminine
  A-GSM  = Adjective Genitive Singular Masculine
  T-DSN  = Article Dative Singular Neuter

Verbs:  V-<tense><voice><mood>[-<person><number>[-<case><number><gender>]]
  V-AAI-3S   = Aorist Active Indicative 3rd Singular
  V-PAP-NSM  = Present Active Participle Nominative Singular Masculine
  V-2AAI-3S  = 2nd Aorist Active Indicative 3rd Singular
  V-FPN      = Future Passive Infinitive

Pronouns:  P-{person or case/number/gender}
  P-1NS  = 1st person Nominative Singular
  P-GSM  = 3rd person Genitive Singular Masculine

In [ ]:

Copied!

from bible_grammar.core.morphology import decode_greek
from bible_grammar.core.morphology import decode_greek

In [ ]:

Copied!

# Verb Aorist Active Indicative 3rd Singular
decode_greek('V-AAI-3S')
# Verb Aorist Active Indicative 3rd Singular
decode_greek('V-AAI-3S')

In [ ]:

Copied!

# Noun Nominative Singular Feminine
decode_greek('N-NSF')
# Noun Nominative Singular Feminine
decode_greek('N-NSF')

In [ ]:

Copied!

# Verb Present Passive/Middle Participle (no person/number = participle or infinitive)
decode_greek('V-PPN')
# Verb Present Passive/Middle Participle (no person/number = participle or infinitive)
decode_greek('V-PPN')

In [ ]:

Copied!





# Uninflected forms
print('ADV: ', decode_greek('ADV'))
print('CONJ:', decode_greek('CONJ'))
print('PREP:', decode_greek('PREP'))
# Uninflected forms
print('ADV: ', decode_greek('ADV'))
print('CONJ:', decode_greek('CONJ'))
print('PREP:', decode_greek('PREP'))

In [ ]:

Copied!





# Full range of verb examples
verb_codes = [
    'V-PAI-3S',   # Present Active Indicative 3S
    'V-IAI-3S',   # Imperfect Active Indicative 3S
    'V-FAI-3S',   # Future Active Indicative 3S
    'V-AAI-3S',   # Aorist Active Indicative 3S
    'V-XAI-3S',   # Perfect Active Indicative 3S
    'V-2AAI-3S',  # 2nd Aorist Active Indicative 3S
    'V-API-3S',   # Aorist Passive Indicative 3S
    'V-PAS-3S',   # Present Active Subjunctive 3S
    'V-PAM-2S',   # Present Active Imperative 2S
    'V-PAN',      # Present Active Infinitive
    'V-PAP-NSM',  # Present Active Participle NSM
    'V-AAP-NSM',  # Aorist Active Participle NSM
]

rows = []
for code in verb_codes:
    d = decode_greek(code)
    d['_code'] = code
    rows.append(d)

pd.DataFrame(rows).set_index('_code')[[
    'part_of_speech', 'tense', 'voice', 'mood', 'person', 'number', 'case_', 'gender'
]]
# Full range of verb examples
verb_codes = [
    'V-PAI-3S',   # Present Active Indicative 3S
    'V-IAI-3S',   # Imperfect Active Indicative 3S
    'V-FAI-3S',   # Future Active Indicative 3S
    'V-AAI-3S',   # Aorist Active Indicative 3S
    'V-XAI-3S',   # Perfect Active Indicative 3S
    'V-2AAI-3S',  # 2nd Aorist Active Indicative 3S
    'V-API-3S',   # Aorist Passive Indicative 3S
    'V-PAS-3S',   # Present Active Subjunctive 3S
    'V-PAM-2S',   # Present Active Imperative 2S
    'V-PAN',      # Present Active Infinitive
    'V-PAP-NSM',  # Present Active Participle NSM
    'V-AAP-NSM',  # Aorist Active Participle NSM
]

rows = []
for code in verb_codes:
    d = decode_greek(code)
    d['_code'] = code
    rows.append(d)

pd.DataFrame(rows).set_index('_code')[[
    'part_of_speech', 'tense', 'voice', 'mood', 'person', 'number', 'case_', 'gender'
]]

In [ ]:

Copied!





# Noun and adjective examples
nominal_codes = [
    ('N-NSM', 'Noun Nominative Singular Masculine'),
    ('N-GSF', 'Noun Genitive Singular Feminine'),
    ('N-DPN', 'Noun Dative Plural Neuter'),
    ('N-ASM', 'Noun Accusative Singular Masculine'),
    ('A-NSM', 'Adjective Nominative Singular Masculine'),
    ('T-NSM', 'Article Nominative Singular Masculine'),
]

for code, label in nominal_codes:
    d = decode_greek(code)
    print(f"{code:<10} -> {label}")
    print(f"           {d}")
# Noun and adjective examples
nominal_codes = [
    ('N-NSM', 'Noun Nominative Singular Masculine'),
    ('N-GSF', 'Noun Genitive Singular Feminine'),
    ('N-DPN', 'Noun Dative Plural Neuter'),
    ('N-ASM', 'Noun Accusative Singular Masculine'),
    ('A-NSM', 'Adjective Nominative Singular Masculine'),
    ('T-NSM', 'Article Nominative Singular Masculine'),
]

for code, label in nominal_codes:
    d = decode_greek(code)
    print(f"{code:<10} -> {label}")
    print(f"           {d}")

4. Extracting Greek Grammar from Raw Data¶

In the raw TAGNT TSV files, column 3 contains a combined dStrongs=Grammar cell (e.g. G3056=N-NSM). extract_greek_grammar() splits this into a (strongs, grammar_code) tuple.

This function is called internally by ingest.py during parsing, so you will not normally need to call it directly. It is documented here for completeness and for cases where you are working with raw TAGNT data outside the normal load path.

In [ ]:

Copied!

from bible_grammar.core.morphology import extract_greek_grammar
from bible_grammar.core.morphology import extract_greek_grammar

In [ ]:

Copied!





# Split a raw dStrongs=Grammar cell
examples = [
    'G3056=N-NSM',
    'G1722=PREP',
    'G0746=N-DSF',
    'G1510=V-IAI-3S',
]
for raw in examples:
    strongs, grammar = extract_greek_grammar(raw)
    morph = decode_greek(grammar)
    print(f"{raw:<20} -> strongs={strongs}  grammar={grammar}")
    print(f"{'':20}    decoded: {morph}")
    print()
# Split a raw dStrongs=Grammar cell
examples = [
    'G3056=N-NSM',
    'G1722=PREP',
    'G0746=N-DSF',
    'G1510=V-IAI-3S',
]
for raw in examples:
    strongs, grammar = extract_greek_grammar(raw)
    morph = decode_greek(grammar)
    print(f"{raw:<20} -> strongs={strongs}  grammar={grammar}")
    print(f"{'':20}    decoded: {morph}")
    print()

5. Practical Use — Finding All Verbs with a Specific Morphology¶

The typical developer workflow is to use the pre-decoded columns in the main DataFrame rather than calling decode_hebrew() directly. However, the morph_code column lets you verify any individual token, and you can apply decode_hebrew() to the raw column if you need to filter on a field that the pre-parsed columns do not expose.

This section shows both approaches.

In [ ]:

Copied!

from bible_grammar.core.db import load

df = load()
ot = df[df['source'] == 'TAHOT'].copy()
from bible_grammar.core.db import load

df = load()
ot = df[df['source'] == 'TAHOT'].copy()

In [ ]:

Copied!





# Approach 1 (preferred): use pre-decoded columns
# Find all Hiphil Imperfect 3ms forms in the OT
hiphil_impf_3ms = ot[
    (ot['part_of_speech'] == 'Verb') &
    (ot['stem'] == 'Hiphil') &
    (ot['conjugation'] == 'Imperfect') &
    (ot['person'] == '3rd') &
    (ot['gender'] == 'Masculine') &
    (ot['number'] == 'Singular')
]
print(f'Hiphil Imperfect 3ms: {len(hiphil_impf_3ms):,} tokens')
hiphil_impf_3ms[['book_id', 'chapter', 'verse', 'word', 'strongs', 'morph_code']].head(10)
# Approach 1 (preferred): use pre-decoded columns
# Find all Hiphil Imperfect 3ms forms in the OT
hiphil_impf_3ms = ot[
    (ot['part_of_speech'] == 'Verb') &
    (ot['stem'] == 'Hiphil') &
    (ot['conjugation'] == 'Imperfect') &
    (ot['person'] == '3rd') &
    (ot['gender'] == 'Masculine') &
    (ot['number'] == 'Singular')
]
print(f'Hiphil Imperfect 3ms: {len(hiphil_impf_3ms):,} tokens')
hiphil_impf_3ms[['book_id', 'chapter', 'verse', 'word', 'strongs', 'morph_code']].head(10)

In [ ]:

Copied!





# Verify a raw morph_code from the results
sample_code = hiphil_impf_3ms['morph_code'].iloc[0]
print('Raw morph_code:', sample_code)
print('Decoded:       ', decode_hebrew(sample_code))
# Verify a raw morph_code from the results
sample_code = hiphil_impf_3ms['morph_code'].iloc[0]
print('Raw morph_code:', sample_code)
print('Decoded:       ', decode_hebrew(sample_code))

In [ ]:

Copied!





# Approach 2: apply decode_hebrew to the raw morph_code column
# Useful when you need a field not stored in the pre-decoded columns
# (e.g., 'prefixes' — whether the word has a prefixed preposition or conjunction)

sample = ot[(ot['part_of_speech'] == 'Verb') & (ot['stem'] == 'Niphal')].head(200).copy()
sample['decoded'] = sample['morph_code'].apply(decode_hebrew)
sample['has_prefix'] = sample['decoded'].apply(lambda d: bool(d.get('prefixes', '')))

print('Niphal verbs with prefix:', sample['has_prefix'].sum())
sample[sample['has_prefix']][['word', 'morph_code', 'decoded']].head(5)
# Approach 2: apply decode_hebrew to the raw morph_code column
# Useful when you need a field not stored in the pre-decoded columns
# (e.g., 'prefixes' — whether the word has a prefixed preposition or conjunction)

sample = ot[(ot['part_of_speech'] == 'Verb') & (ot['stem'] == 'Niphal')].head(200).copy()
sample['decoded'] = sample['morph_code'].apply(decode_hebrew)
sample['has_prefix'] = sample['decoded'].apply(lambda d: bool(d.get('prefixes', '')))

print('Niphal verbs with prefix:', sample['has_prefix'].sum())
sample[sample['has_prefix']][['word', 'morph_code', 'decoded']].head(5)

In [ ]:

Copied!





# Find all Piel Perfect 3ms in Genesis
piel_perf_gen = ot[
    (ot['book_id'] == 'Gen') &
    (ot['stem'] == 'Piel') &
    (ot['conjugation'] == 'Perfect') &
    (ot['person'] == '3rd') &
    (ot['gender'] == 'Masculine') &
    (ot['number'] == 'Singular')
][['chapter', 'verse', 'word', 'strongs', 'morph_code']]

print(f'Piel Perfect 3ms in Genesis: {len(piel_perf_gen)} tokens')
piel_perf_gen.head(10)
# Find all Piel Perfect 3ms in Genesis
piel_perf_gen = ot[
    (ot['book_id'] == 'Gen') &
    (ot['stem'] == 'Piel') &
    (ot['conjugation'] == 'Perfect') &
    (ot['person'] == '3rd') &
    (ot['gender'] == 'Masculine') &
    (ot['number'] == 'Singular')
][['chapter', 'verse', 'word', 'strongs', 'morph_code']]

print(f'Piel Perfect 3ms in Genesis: {len(piel_perf_gen)} tokens')
piel_perf_gen.head(10)

In [ ]:

Copied!





# NT: find all Aorist Passive Indicative 3rd Singular verbs
nt = df[df['source'] == 'TAGNT'].copy()

aor_pass = nt[
    (nt['part_of_speech'] == 'Verb') &
    (nt['tense'] == 'Aorist') &
    (nt['voice'] == 'Passive') &
    (nt['mood'] == 'Indicative') &
    (nt['person'] == '3rd') &
    (nt['number'] == 'Singular')
]
print(f'Aorist Passive Indicative 3S in NT: {len(aor_pass):,} tokens')
aor_pass[['book_id', 'chapter', 'verse', 'word', 'strongs', 'morph_code']].head(10)
# NT: find all Aorist Passive Indicative 3rd Singular verbs
nt = df[df['source'] == 'TAGNT'].copy()

aor_pass = nt[
    (nt['part_of_speech'] == 'Verb') &
    (nt['tense'] == 'Aorist') &
    (nt['voice'] == 'Passive') &
    (nt['mood'] == 'Indicative') &
    (nt['person'] == '3rd') &
    (nt['number'] == 'Singular')
]
print(f'Aorist Passive Indicative 3S in NT: {len(aor_pass):,} tokens')
aor_pass[['book_id', 'chapter', 'verse', 'word', 'strongs', 'morph_code']].head(10)

6. Quick Reference¶

# ── morphology.py ─────────────────────────────────────────────────────────────
from bible_grammar.morphology import decode_hebrew, decode_greek, extract_greek_grammar

decode_hebrew('HVqp3ms')    # -> dict with part_of_speech, stem, conjugation, person, gender, number
decode_hebrew('HNcmsa')     # -> dict with part_of_speech, noun_type, gender, number, state
decode_hebrew('HC/Td/Nfsa') # -> dict for the main word; prefixes stored in 'prefixes' key

decode_greek('V-AAI-3S')    # -> dict with part_of_speech, tense, voice, mood, person, number
decode_greek('N-NSF')       # -> dict with part_of_speech, case_, number, gender
decode_greek('ADV')         # -> {'part_of_speech': 'Adverb'}

extract_greek_grammar('G3056=N-NSM')  # -> ('G3056', 'N-NSM')

Hebrew Stem Codes¶

Code	Stem	Code	Stem
`q`	Qal	`H`	Hophal
`N`	Niphal	`t`	Hithpael
`p`	Piel	`o`	Polal
`P`	Pual	`b`	Peal (Aramaic)
`h`	Hiphil	`a`	Pael (Aramaic)
`v`	Haphel (Aramaic)	`A`	Aphel (Aramaic)

Hebrew Conjugation Codes¶

Code	Conjugation	Code	Conjugation
`p`	Perfect	`r`	Participle
`i`	Imperfect	`s`	Participle passive
`w`	Consecutive Perfect	`c`	Infinitive construct
`q`	Consecutive Imperfect	`a`	Infinitive absolute
`v`	Imperative	`h`	Cohortative
		`j`	Jussive

Greek Tense Codes¶

Code	Tense	Code	Tense
`P`	Present	`X`	Perfect
`I`	Imperfect	`Y`	Pluperfect
`F`	Future	`2A`	2nd Aorist
`A`	Aorist	`2X`	2nd Perfect

Greek Voice and Mood Codes¶

Voice Code	Meaning	Mood Code	Meaning
`A`	Active	`I`	Indicative
`M`	Middle	`S`	Subjunctive
`P`	Passive	`O`	Optative
`E`	Middle or Passive	`M`	Imperative
`D`	Deponent	`N`	Infinitive
		`P`	Participle