Morphology Code Reference — Decoding STEPBible Grammar Tags¶
This notebook documents the morphology module, which decodes the raw STEPBible grammar codes in the morph_code column into structured Python dicts.
This is useful for developers who are:
- Adding new analysis features that filter on specific morphological forms
- Debugging unexpected values in the
stem,conjugation,tense, etc. columns - Understanding why a particular word token has the POS or stem it does
- Extending the decoder to handle edge cases or new code patterns
1. Overview¶
Why raw codes exist¶
The STEPBible TAHOT and TAGNT TSV files store morphology as compact grammar codes in a single cell. These codes are parsed at ingest time by morphology.py and expanded into the structured columns in the main DataFrame (part_of_speech, stem, conjugation, tense, voice, mood, etc.). The raw code is preserved in the morph_code column for auditability.
Code format examples¶
Hebrew: HVqp3ms = Hebrew Verb Qal Perfect 3ms
HNcmsa = Hebrew Noun common masc sing absolute
HVhp3ms = Hebrew Verb Hiphil Perfect 3ms
HVNp3ms = Hebrew Verb Niphal Perfect 3ms
HC/Td/Ncfsa = prefix conjunction + prefix article + Noun (slash-joined)
Greek: V-AAI-3S = Verb Aorist Active Indicative 3rd Singular
N-NSF = Noun Nominative Singular Feminine
V-PAP-NSM = Verb Present Active Participle Nominative Singular Masculine
ADV = Adverb (uninflected)
CONJ = Conjunction (uninflected)
When to use this module¶
In most cases you work with the already-decoded columns in the main DataFrame. You only need morphology.py directly when:
- Decoding a raw code that came from outside the normal load path
- Inspecting what a specific code in
morph_codemeans - Writing tests for the decoder
- Extending the decoder for a new code pattern
The decoder is also called internally by ingest.py during the initial parse.
import sys
sys.path.insert(0, '../../src')
2. Decoding Hebrew Morphology¶
decode_hebrew(morph_code) takes a raw TAHOT grammar code and returns a dict. It handles slash-joined tokens (prefix chains like HC/Td/Ncfsa): the main word is always the last (rightmost) token; prefixes are collected into the 'prefixes' key as a +-joined string.
Hebrew code structure (single token):
Position 0: Language H=Hebrew A=Aramaic
Position 1: Function V=Verb N=Noun A=Adjective R=Preposition
C=Conjunction D=Adverb T=Particle P=Pronoun
For Verb (HV...):
Position 2: Stem q=Qal N=Niphal p=Piel P=Pual h=Hiphil H=Hophal t=Hithpael
Position 3: Form p=Perfect i=Imperfect w=ConsecPerf q=ConsecImpf
v=Imperative r=Participle c=InfConstruct a=InfAbsolute
Position 4: Person 1 2 3
Position 5: Gender m=Masc f=Fem c=Common
Position 6: Number s=Sing p=Plural d=Dual
For Noun (HN...):
Position 2: Type c=Common p=Proper g=Gentilic
Position 3: Gender m=Masc f=Fem c=Common
Position 4: Number s p d
Position 5: State a=Absolute c=Construct d=Definite
from bible_grammar.morphology import decode_hebrew
# Verb Qal Perfect 3ms
decode_hebrew('HVqp3ms')
# Noun common masc sing absolute
decode_hebrew('HNcmsa')
# Hiphil Perfect 3ms
decode_hebrew('HVhp3ms')
# Niphal Perfect 3ms
decode_hebrew('HVNp3ms')
# Several more codes to illustrate the full range
codes = [
('HVpi3ms', 'Piel Imperfect 3ms'),
('HVPp3ms', 'Pual Perfect 3ms'),
('HVtp3ms', 'Hithpael Perfect 3ms'),
('HVHi3ms', 'Hophal Imperfect 3ms'),
('HVqr', 'Qal Participle'),
('HVqc', 'Qal Infinitive Construct'),
('HVqa', 'Qal Infinitive Absolute'),
('HNcfsa', 'Noun common fem sing absolute'),
('HNpmsa', 'Noun proper masc sing absolute'),
('HAmfsa', 'Adjective masc sing absolute'),
]
import pandas as pd
rows = []
for code, label in codes:
d = decode_hebrew(code)
d['_code'] = code
d['_label'] = label
rows.append(d)
pd.DataFrame(rows).set_index('_code')[[
'_label', 'part_of_speech', 'stem', 'conjugation', 'person', 'gender', 'number', 'state'
]]
# Slash-joined prefix chain: conjunction + article + noun
decode_hebrew('HC/Td/Ncfsa')
# Aramaic verb (prefix A instead of H)
decode_hebrew('AVbp3ms') # Peal Perfect 3ms
3. Decoding Greek Morphology¶
decode_greek(grammar_field) takes the grammar portion of a TAGNT dStrongs=Grammar cell (i.e., everything after the =) and returns a dict.
Greek code structure:
Uninflected: ADV CONJ PREP PRT INJ COND
Nouns/Adj/Article: <func>-<case><number><gender>
N-NSF = Noun Nominative Singular Feminine
A-GSM = Adjective Genitive Singular Masculine
T-DSN = Article Dative Singular Neuter
Verbs: V-<tense><voice><mood>[-<person><number>[-<case><number><gender>]]
V-AAI-3S = Aorist Active Indicative 3rd Singular
V-PAP-NSM = Present Active Participle Nominative Singular Masculine
V-2AAI-3S = 2nd Aorist Active Indicative 3rd Singular
V-FPN = Future Passive Infinitive
Pronouns: P-{person or case/number/gender}
P-1NS = 1st person Nominative Singular
P-GSM = 3rd person Genitive Singular Masculine
from bible_grammar.morphology import decode_greek
# Verb Aorist Active Indicative 3rd Singular
decode_greek('V-AAI-3S')
# Noun Nominative Singular Feminine
decode_greek('N-NSF')
# Verb Present Passive/Middle Participle (no person/number = participle or infinitive)
decode_greek('V-PPN')
# Uninflected forms
print('ADV: ', decode_greek('ADV'))
print('CONJ:', decode_greek('CONJ'))
print('PREP:', decode_greek('PREP'))
# Full range of verb examples
verb_codes = [
'V-PAI-3S', # Present Active Indicative 3S
'V-IAI-3S', # Imperfect Active Indicative 3S
'V-FAI-3S', # Future Active Indicative 3S
'V-AAI-3S', # Aorist Active Indicative 3S
'V-XAI-3S', # Perfect Active Indicative 3S
'V-2AAI-3S', # 2nd Aorist Active Indicative 3S
'V-API-3S', # Aorist Passive Indicative 3S
'V-PAS-3S', # Present Active Subjunctive 3S
'V-PAM-2S', # Present Active Imperative 2S
'V-PAN', # Present Active Infinitive
'V-PAP-NSM', # Present Active Participle NSM
'V-AAP-NSM', # Aorist Active Participle NSM
]
rows = []
for code in verb_codes:
d = decode_greek(code)
d['_code'] = code
rows.append(d)
pd.DataFrame(rows).set_index('_code')[[
'part_of_speech', 'tense', 'voice', 'mood', 'person', 'number', 'case_', 'gender'
]]
# Noun and adjective examples
nominal_codes = [
('N-NSM', 'Noun Nominative Singular Masculine'),
('N-GSF', 'Noun Genitive Singular Feminine'),
('N-DPN', 'Noun Dative Plural Neuter'),
('N-ASM', 'Noun Accusative Singular Masculine'),
('A-NSM', 'Adjective Nominative Singular Masculine'),
('T-NSM', 'Article Nominative Singular Masculine'),
]
for code, label in nominal_codes:
d = decode_greek(code)
print(f"{code:<10} -> {label}")
print(f" {d}")
4. Extracting Greek Grammar from Raw Data¶
In the raw TAGNT TSV files, column 3 contains a combined dStrongs=Grammar cell (e.g. G3056=N-NSM). extract_greek_grammar() splits this into a (strongs, grammar_code) tuple.
This function is called internally by ingest.py during parsing, so you will not normally need to call it directly. It is documented here for completeness and for cases where you are working with raw TAGNT data outside the normal load path.
from bible_grammar.morphology import extract_greek_grammar
# Split a raw dStrongs=Grammar cell
examples = [
'G3056=N-NSM',
'G1722=PREP',
'G0746=N-DSF',
'G1510=V-IAI-3S',
]
for raw in examples:
strongs, grammar = extract_greek_grammar(raw)
morph = decode_greek(grammar)
print(f"{raw:<20} -> strongs={strongs} grammar={grammar}")
print(f"{'':20} decoded: {morph}")
print()
5. Practical Use — Finding All Verbs with a Specific Morphology¶
The typical developer workflow is to use the pre-decoded columns in the main DataFrame rather than calling decode_hebrew() directly. However, the morph_code column lets you verify any individual token, and you can apply decode_hebrew() to the raw column if you need to filter on a field that the pre-parsed columns do not expose.
This section shows both approaches.
from bible_grammar.db import load
df = load()
ot = df[df['source'] == 'TAHOT'].copy()
# Approach 1 (preferred): use pre-decoded columns
# Find all Hiphil Imperfect 3ms forms in the OT
hiphil_impf_3ms = ot[
(ot['part_of_speech'] == 'Verb') &
(ot['stem'] == 'Hiphil') &
(ot['conjugation'] == 'Imperfect') &
(ot['person'] == '3rd') &
(ot['gender'] == 'Masculine') &
(ot['number'] == 'Singular')
]
print(f'Hiphil Imperfect 3ms: {len(hiphil_impf_3ms):,} tokens')
hiphil_impf_3ms[['book_id', 'chapter', 'verse', 'word', 'strongs', 'morph_code']].head(10)
# Verify a raw morph_code from the results
sample_code = hiphil_impf_3ms['morph_code'].iloc[0]
print('Raw morph_code:', sample_code)
print('Decoded: ', decode_hebrew(sample_code))
# Approach 2: apply decode_hebrew to the raw morph_code column
# Useful when you need a field not stored in the pre-decoded columns
# (e.g., 'prefixes' — whether the word has a prefixed preposition or conjunction)
sample = ot[(ot['part_of_speech'] == 'Verb') & (ot['stem'] == 'Niphal')].head(200).copy()
sample['decoded'] = sample['morph_code'].apply(decode_hebrew)
sample['has_prefix'] = sample['decoded'].apply(lambda d: bool(d.get('prefixes', '')))
print('Niphal verbs with prefix:', sample['has_prefix'].sum())
sample[sample['has_prefix']][['word', 'morph_code', 'decoded']].head(5)
# Find all Piel Perfect 3ms in Genesis
piel_perf_gen = ot[
(ot['book_id'] == 'Gen') &
(ot['stem'] == 'Piel') &
(ot['conjugation'] == 'Perfect') &
(ot['person'] == '3rd') &
(ot['gender'] == 'Masculine') &
(ot['number'] == 'Singular')
][['chapter', 'verse', 'word', 'strongs', 'morph_code']]
print(f'Piel Perfect 3ms in Genesis: {len(piel_perf_gen)} tokens')
piel_perf_gen.head(10)
# NT: find all Aorist Passive Indicative 3rd Singular verbs
nt = df[df['source'] == 'TAGNT'].copy()
aor_pass = nt[
(nt['part_of_speech'] == 'Verb') &
(nt['tense'] == 'Aorist') &
(nt['voice'] == 'Passive') &
(nt['mood'] == 'Indicative') &
(nt['person'] == '3rd') &
(nt['number'] == 'Singular')
]
print(f'Aorist Passive Indicative 3S in NT: {len(aor_pass):,} tokens')
aor_pass[['book_id', 'chapter', 'verse', 'word', 'strongs', 'morph_code']].head(10)
6. Quick Reference¶
# ── morphology.py ─────────────────────────────────────────────────────────────
from bible_grammar.morphology import decode_hebrew, decode_greek, extract_greek_grammar
decode_hebrew('HVqp3ms') # -> dict with part_of_speech, stem, conjugation, person, gender, number
decode_hebrew('HNcmsa') # -> dict with part_of_speech, noun_type, gender, number, state
decode_hebrew('HC/Td/Nfsa') # -> dict for the main word; prefixes stored in 'prefixes' key
decode_greek('V-AAI-3S') # -> dict with part_of_speech, tense, voice, mood, person, number
decode_greek('N-NSF') # -> dict with part_of_speech, case_, number, gender
decode_greek('ADV') # -> {'part_of_speech': 'Adverb'}
extract_greek_grammar('G3056=N-NSM') # -> ('G3056', 'N-NSM')
Hebrew Stem Codes¶
| Code | Stem | Code | Stem | |
|---|---|---|---|---|
q |
Qal | H |
Hophal | |
N |
Niphal | t |
Hithpael | |
p |
Piel | o |
Polal | |
P |
Pual | b |
Peal (Aramaic) | |
h |
Hiphil | a |
Pael (Aramaic) | |
v |
Haphel (Aramaic) | A |
Aphel (Aramaic) |
Hebrew Conjugation Codes¶
| Code | Conjugation | Code | Conjugation | |
|---|---|---|---|---|
p |
Perfect | r |
Participle | |
i |
Imperfect | s |
Participle passive | |
w |
Consecutive Perfect | c |
Infinitive construct | |
q |
Consecutive Imperfect | a |
Infinitive absolute | |
v |
Imperative | h |
Cohortative | |
j |
Jussive |
Greek Tense Codes¶
| Code | Tense | Code | Tense | |
|---|---|---|---|---|
P |
Present | X |
Perfect | |
I |
Imperfect | Y |
Pluperfect | |
F |
Future | 2A |
2nd Aorist | |
A |
Aorist | 2X |
2nd Perfect |
Greek Voice and Mood Codes¶
| Voice Code | Meaning | Mood Code | Meaning | |
|---|---|---|---|---|
A |
Active | I |
Indicative | |
M |
Middle | S |
Subjunctive | |
P |
Passive | O |
Optative | |
E |
Middle or Passive | M |
Imperative | |
D |
Deponent | N |
Infinitive | |
P |
Participle |