Collocations & Phrase Search — What Words Appear Near Each Other?¶
Two complementary analysis approaches:
Collocation (statistical) asks: which words appear near a target word more often than chance would predict? Using Pointwise Mutual Information (PMI) and the G² (log-likelihood) statistic, we identify non-random word associations. High PMI means the pairing is distinctive to the target word; high G² means it is also statistically robust (not driven by rare words).
Phrase search (exact matching) asks: where does this specific word sequence occur within a verse? Positions can be Strong's numbers, lemmas, morphology constraints, or wildcards (None / '*'). Results include the surrounding KJV verse for context.
Proximity search relaxes the adjacency requirement: find verses where two words appear within N positions of each other, without requiring them to be consecutive.
import sys
sys.path.insert(0, '../../../src')
import pandas as pd
from bible_grammar.collocation import collocations, print_collocations
from bible_grammar.phrase import phrase_search, proximity_search, print_phrase_results, print_proximity_results
1. Collocations: רוּחַ (H7307, spirit/wind)¶
רוּחַ (ruach) is one of the most semantically rich words in the Hebrew Bible — it means wind, breath, and spirit. Its collocates reveal which contexts define its meaning:
- Co-occurrence with אֱלֹהִים (God) marks the divine creative spirit (Gen 1:2 "the Spirit of God hovered")
- Co-occurrence with רָעָה (evil) marks demonic/destructive spirit contexts
- Co-occurrence with קָדַשׁ (holy) marks the Spirit of Holiness (esp. Psalms, Isaiah)
PMI > 0 means the word pair occurs more often together than chance; G² > 10 is typically significant.
print_collocations('H7307', window=5, corpus='OT')
2. Collocations: חֶסֶד (H2617, steadfast love)¶
חֶסֶד (hesed) is one of the OT's key theological terms — often translated "steadfast love," "lovingkindness," or "covenant faithfulness." Its collocates are the closest thing we have to a biblical semantic field for this term. Expect co-occurrence with:
- אֱמֶת (truth/faithfulness) — the hendiadys חֶסֶד וֶאֱמֶת is a formulaic pair
- יְהוָה (YHWH) — hesed is preeminently YHWH's characteristic
- עוֹלָם (everlasting) — hesed is described as eternal (Ps 136)
- רַחֲמִים (compassion) — paired in covenant renewal passages
print_collocations('H2617', window=5, corpus='OT')
3. Collocations: λόγος (G3056, word)¶
In the Greek NT, λόγος (logos) ranges from ordinary speech to the divine Word of John's Prologue. Collocates mark the semantic fields in play: θεός (God), Ἰησοῦς (Jesus), εὐαγγέλιον (gospel), and vocabulary of proclamation and reception.
print_collocations('G3056', window=5, corpus='NT')
4. Collocations: πίστις (G4102, faith)¶
πίστις (pistis) is the central Pauline virtue. Its collocates reflect the Pauline theological cluster: χάρις (grace) — faith and grace are linked in the justification argument; ἔργα (works) — the Pauline antithesis; ἀγάπη (love) — faith working through love (Gal 5:6); δικαιοσύνη (righteousness) — righteousness through faith.
print_collocations('G4102', window=5, corpus='NT')
5. Collocations as DataFrame¶
The collocations() function returns a DataFrame for custom filtering, sorting, or visualization. Columns include:
co_count— observed co-occurrence countexpected— expected co-occurrence under independencepmi— Pointwise Mutual Information (log₂ scale)log_likelihood— G² statistic (higher = more significant)
df = collocations('H7307', window=5, corpus='OT')
print(f"Collocates found: {len(df)}")
df.head(20)
6. Phrase Search: דְּבַר יְהוָה (word of the LORD)¶
"The word of the LORD" (דְּבַר יְהוָה / dabar YHWH) is one of the most frequent phrases in the OT prophets. It introduces the vast majority of prophetic oracles and marks the authoritative divine speech event. The phrase appears hundreds of times in Jeremiah and Ezekiel alone.
results = phrase_search(['H1697', 'H3068'])
print(f"Total occurrences of dabar YHWH: {len(results)}")
print_phrase_results(results, max_rows=15)
# Book-by-book breakdown
if not results.empty:
by_book = results.groupby('book_id').size().sort_values(ascending=False)
print("Occurrences by book:")
print(by_book.head(15).to_string())
7. Phrase Search: κύριος Ἰησοῦς (Lord Jesus)¶
"Lord Jesus" (κύριος Ἰησοῦς) is a distinctly Pauline confessional formula. The combination of the divine title Kyrios with the human name Iesous encapsulates the early church's high Christology — Jesus is identified with the YHWH of the OT. The phrase is rare in the Gospels (where Kyrios is used as respectful address) but frequent in Paul's letters.
results = phrase_search(['G2962', 'G2424'], corpus='NT')
print(f"Total occurrences of kyrios Iesous: {len(results)}")
print_phrase_results(results, max_rows=15)
# Book-by-book breakdown
if not results.empty:
by_book = results.groupby('book_id').size().sort_values(ascending=False)
print("Occurrences by book:")
print(by_book.to_string())
8. Phrase Search with Wildcard¶
A wildcard (None or '*') matches any word. This is useful for finding two anchoring words with intervening material. Here we search for YHWH + [any word] + said (H0559) — a pattern matching speech introductions like "YHWH then said" or "YHWH again said."
The wildcard syntax allows one or more intervening positions, enabling flexible pattern matching within a verse.
# YHWH + [any word] + said (one intervening word)
results = phrase_search(['H3068', None, 'H0559'])
print(f"YHWH ... said (1 intervening word): {len(results)}")
print_phrase_results(results, max_rows=10)
9. Proximity Search¶
Proximity search finds verses where two words appear within N words of each other, without requiring adjacency or a specific intervening structure. This is more flexible than phrase search.
Searching for רוּחַ (H7307, spirit) and אֱלֹהִים (H0430, God) within 5 words of each other — the "Spirit of God" construction in various forms:
results = proximity_search(['H7307', 'H0430'], within=5, corpus='OT')
print(f"ruach within 5 words of Elohim: {len(results)}")
print_proximity_results(results, max_rows=15)
10. Quick Reference¶
from bible_grammar.collocation import collocations, print_collocations
from bible_grammar.phrase import phrase_search, proximity_search, print_phrase_results, print_proximity_results
# ── Collocations ──────────────────────────────────────────────────────────────
# Terminal output (formatted)
print_collocations('H7307', window=5, corpus='OT')
print_collocations('G3056', window=5, corpus='NT')
print_collocations('H7307', book='Gen', min_count=2) # restrict to one book
# Return DataFrame
df = collocations('H7307', window=5, corpus='OT')
df = collocations('G4102', window=5, corpus='NT', top_n=30)
# Columns: strongs, lemma, gloss, co_count, target_count, collocate_count,
# corpus_size, expected, pmi, log_likelihood
# ── Phrase Search ─────────────────────────────────────────────────────────────
# Two consecutive Strong's numbers
phrase_search(['H1697', 'H3068']) # dabar YHWH (OT default)
phrase_search(['G2962', 'G2424'], corpus='NT') # kyrios Iesous
# Wildcard: None or '*' matches any word
phrase_search(['H3068', None, 'H0559']) # YHWH + any + said
phrase_search(['H3068', '*', '*', 'H0559']) # YHWH + any + any + said
# Morphology constraints
phrase_search([{'stem': 'Niphal', 'conjugation': 'Perfect'}, {'pos': 'Noun'}])
# Book restriction
phrase_search(['H1697', 'H3068'], book='Jer') # Jeremiah only
phrase_search(['G2962', 'G2424'], book_group='pauline') # Pauline only
# Print results
print_phrase_results(df, max_rows=25)
print_phrase_results(df, show_strongs=True)
# ── Proximity Search ──────────────────────────────────────────────────────────
# Two words within N words (any order)
proximity_search(['H7307', 'H0430'], within=5, corpus='OT') # ruach + Elohim
proximity_search(['G4102', 'G5485'], within=7, corpus='NT') # pistis + charis
# Ordered: first token must precede second
proximity_search(['H6944', 'H2617'], within=10, ordered=True)
# Print results
print_proximity_results(df, max_rows=20)