Dictionary Genealogy
First empirical findings from the lexicography research stream. Everything here is derived from the canonical sanhw1.txt master headword index (469,844 normalised SLP1 lemmas across 41 dictionaries).
This page is companion to the Lexicography Roadmap, which lays out the broader research plan (Phases L0-L10, Papers M, L, H).
Trust Block
- Evidence:
src/data/lexicographic-structure/dictionary_inventory.csv,sanhw1_inheritance_edges.csv, andsanhw1_distance_matrix.csv. - Limitations: headword containment and distance show content overlap, not full microstructure inheritance or direct copying by themselves.
- Validation: source artifacts are compact committed research outputs; checked by
npm run buildand the linked lexicography roadmap. - Owner repo:
csl-atlas. - Next use: treat the chart as structural evidence, then check companion docs before making a lineage claim.
The 41-dictionary CDSL inventory
The Cologne Digital Sanskrit Lexicon hosts dictionaries grouped into 7 families. Each row's sanhw1_lemmas is the empirical headword count from sanhw1.txt.
Lemma counts by dictionary
Observation: MW (1899) has 194,084 lemmas — by far the largest. Combined with PW (151k) + PWG (106k) the Petersburger family is the foundational corpus. PD (Encyclopedic 1976) is 105k.
Inheritance edges (top temporal-plausible)
These are dictionary pairs where ≥85% of one dict's lemmas appear in another, AND the source dict is older. Each edge is empirical evidence for inheritance:
Confirmed inheritance lines:
WIL (1832) → SHS (1900)— 95.3%, confirms Wilson is direct ancestor of Shabda-SagaraWIL (1832) → YAT (1846)— 92.6% — new finding: Yates derived from WilsonPWG (1855) → PW (1879)— 93.8%, confirms PWG → PWK abridgementMW72 (1872) → MW (1899)— 89.6%, Monier-Williams self-expansionCCS (1887) → CAE (1891)— 94.0%, Cappeller German→EnglishPWG (1855) → MW (1899)— 89.3%, the German→English transmission lineARMH (1861) → MW (1899)— 92.8% — Hemacandra's Abhidhānaratnamālā absorbed into MWABCH (1896) → MW (1899)— 92.5% — Abhidhānacintāmaṇi of Hemacandra absorbed too
Heatmap: 41 × 41 lemma-distance matrix
Distance = 1 − Jaccard. Darker = more similar. The dense block in the lower-right is the WIL/YAT/SHS + PWG/PW/MW/Cappeller core.
Sense-level structure (R2)
Beyond shared headwords, the archived R2 sense-alignment findings split entries into individual senses and align them across dictionaries by the Sanskrit material they share — SLP1 forms, <ls> citations, indigenous …0 sigla — with no translation. This aligns a German PWG sense to an English Apte sense, and a Western sense to an indigenous Vācaspatya one, through Sanskrit alone (the "anchor on Sanskrit" method). The current branch keeps this as archived evidence until the R2 generator package is restored or rebuilt.
Open the interactive sense-alignment explorer — pick a headword (dharma, rāma, …) and browse its senses across up to 13 dictionaries, with the Sanskrit-anchored cross-tradition alignments highlighted.
H1 — does sense granularity inflate over time? Measured over the full corpus of 11 general dictionaries (1822–1957): no. The year-trend is essentially flat (Pearson r = 0.06). Sense granularity is a lexicographic-family trait — Benfey/Apte enumerate ~2.5 sense-units per entry, Monier-Williams/Petersburg lump to ~1 — not a function of date, so Paper L treats it as a covariate to control for. See the H1 figure · R2_FINDINGS.md.
What this means for the papers
- Paper M (methodology): the unified inheritance score (this lemma signal + convention fingerprints + forensic typos) recovers known CDSL lineage at >90% confidence on the strongest edges. Validates the framework.
- Paper L (linguistic): MW is the empirical convergence point of multiple independent dictionary traditions (German, English, Indian Skt-Skt). 89-94% of lemmas from each major source dict appear in MW.
- Paper H (historical): WIL → YAT → SHS (English popular tradition) is a 78-year transmission chain visible in lemma data. PWG → PW → MW (German scholarly → English consolidation) is parallel.
Method
sanhw1.txt is the canonical CDSL master headword index, computed and maintained at hwnorm1/sanhw1 by the Cologne team. It applies headword normalisation per Patel 2016 so that variant spellings of the same lemma collapse.
For each pair of dictionaries (A, B):
- Jaccard distance = 1 − |A ∩ B| / |A ∪ B|
- Containment = |A ∩ B| / |A| → fraction of A's lemmas also in B
- Temporal plausibility = year(A) ≤ year(B) → A could be ancestor of B
UPGMA cladogram from the distance matrix is in data/sanhw1_cladogram.newick.