Dictionary Genealogy

First empirical findings from the lexicography research stream. Everything here is derived from the canonical sanhw1.txt master headword index (469,844 normalised SLP1 lemmas across 41 dictionaries).

This page is companion to the Lexicography Roadmap, which lays out the broader research plan (Phases L0-L10, Papers M, L, H).

Trust Block

The 41-dictionary CDSL inventory

The Cologne Digital Sanskrit Lexicon hosts dictionaries grouped into 7 families. Each row's sanhw1_lemmas is the empirical headword count from sanhw1.txt.

Lemma counts by dictionary

Observation: MW (1899) has 194,084 lemmas — by far the largest. Combined with PW (151k) + PWG (106k) the Petersburger family is the foundational corpus. PD (Encyclopedic 1976) is 105k.

Inheritance edges (top temporal-plausible)

These are dictionary pairs where ≥85% of one dict's lemmas appear in another, AND the source dict is older. Each edge is empirical evidence for inheritance:

Confirmed inheritance lines:

Heatmap: 41 × 41 lemma-distance matrix

Distance = 1 − Jaccard. Darker = more similar. The dense block in the lower-right is the WIL/YAT/SHS + PWG/PW/MW/Cappeller core.

Sense-level structure (R2)

Beyond shared headwords, the archived R2 sense-alignment findings split entries into individual senses and align them across dictionaries by the Sanskrit material they share — SLP1 forms, <ls> citations, indigenous …0 sigla — with no translation. This aligns a German PWG sense to an English Apte sense, and a Western sense to an indigenous Vācaspatya one, through Sanskrit alone (the "anchor on Sanskrit" method). The current branch keeps this as archived evidence until the R2 generator package is restored or rebuilt.

Open the interactive sense-alignment explorer — pick a headword (dharma, rāma, …) and browse its senses across up to 13 dictionaries, with the Sanskrit-anchored cross-tradition alignments highlighted.

H1 — does sense granularity inflate over time? Measured over the full corpus of 11 general dictionaries (1822–1957): no. The year-trend is essentially flat (Pearson r = 0.06). Sense granularity is a lexicographic-family trait — Benfey/Apte enumerate ~2.5 sense-units per entry, Monier-Williams/Petersburg lump to ~1 — not a function of date, so Paper L treats it as a covariate to control for. See the H1 figure · R2_FINDINGS.md.

What this means for the papers

Method

sanhw1.txt is the canonical CDSL master headword index, computed and maintained at hwnorm1/sanhw1 by the Cologne team. It applies headword normalisation per Patel 2016 so that variant spellings of the same lemma collapse.

For each pair of dictionaries (A, B):

UPGMA cladogram from the distance matrix is in data/sanhw1_cladogram.newick.

← back to overview