Convention fingerprints & the formatting cladogram

How do the Cologne dictionaries relate by house style — the orthographic and citation conventions in which they render Sanskrit, independent of which words they contain? This page builds a phylogenetic tree of CDSL dictionaries from a convention fingerprint: 25 categorical dimensions, of which the seven canonical ones are Dhaval Patel's 2016 normalization conventions, populated from Patel's own per-dictionary classification (ground truth from the convention author). The remaining dimensions are auto-extracted from each dictionary's source markup.

Companion to Dictionary genealogy (which measures shared content via the sanhw1 headword index) and to the Phase L0 design / results. The headline finding is that convention-lineage and content-lineage are distinct signals — see the discussion below.

Trust Block

Evidence: L0 convention-fingerprint files under src/data/lexicographic-structure/L0/, Patel 2016 assignments, and validation reports.
Limitations: convention lineage measures house style and markup practice; it is separate from content inheritance.
Validation: checked by npm run build; L0 validation details are in docs/L0_RESULTS.md and docs/L0_DESIGN.md.
Owner repo: csl-atlas.
Next use: treat the chart as structural evidence, then check companion docs before making a lineage claim.

At a glance

The canonical convention cladogram

UPGMA on the rare-option-weighted Hamming distance over the 25 convention dimensions, with 1000× dimension-bootstrap consensus. Five readable clades emerge: the Petersburg formatting family (PWG/PW/SCH/CCS/CAE), the Latin/German etymological + MW group, the Anglo-Indian line (WIL/SHS/AP90/AP/MD), the indigenous + verb dicts (SKD/VCP/KRM), and a mixed/index cluster.

Bootstrap support for known lineage edges

Each bar is a documented inheritance edge; the value is the fraction of 1000 bootstrap trees in which the child falls among the parent's three nearest neighbours, with a 95% Wilson interval. Tier A = high-confidence (inventory + sanhw1 containment); tier B = scholarly hypothesis under test. The formatting edges (WIL→SHS, PWG→PW, PWG→SCH, CCS→CAE) score high; the content-only edges where the heir reformatted (PWG→MW, MW72→MW) score low — the core finding of this page.

The dashed line marks the 0.80 "strong edge" threshold (design §6.4).

Three algorithms agree on the strong edges (Phase L0-rigor)

The paper-final tree is checked against all three of the design's algorithms — UPGMA and Neighbour-Joining (500× character bootstrap) and a Bayesian Mk MCMC (2-state symmetric morphological model, Felsenstein pruning, NNI + branch-length Metropolis moves, 80k generations). Support = posterior/bootstrap probability that the pair sits in a shared clade of ≤ 4 leaves.

Two takeaways. The strong formatting edges — PWG→PW, PWG→SCH, WIL→SHS, AP90→AP — clear the bar under every algorithm (Bayesian Mk is the most decisive on the Petersburg core, 1.00). The reformatted edges stay low under every algorithm (MW72→MW ≤ 0.43, WIL→YAT ≈ 0): the convention ≠ content result is robust to the clustering method, not an artifact. Bayesian Mk, sensitive to shared derived characters, additionally surfaces PW→CCS (0.74) and the Bopp→MW hypothesis (0.65) that distance-bootstrap dilutes. Robinson–Foulds between the three point estimates: UPGMA–NJ 0.59, NJ–Bayes 0.45, UPGMA–Bayes 0.70 (data/L0/bayesian_report.json).

Patel 2016 convention taxonomy

The seven canonical conventions and the dictionaries Patel assigns to each option. Conventions are multi-valued — a dictionary may follow several options.

Per-dictionary convention fingerprint (Patel's 7)

Each cell is the +-joined option set the dictionary follows. Dicts sharing a row pattern share a house style — note how PWG / PW / SCH are identical and CCS differs by one.

Convention-distance heatmap

Pairwise rare-option-weighted Hamming distance over the 25 dimensions (0 = identical house style, darker = closer). The bright low-distance blocks are the formatting families.

Convention-lineage is not content-lineage

The pattern of which edges the convention tree recovers — and which it misses — is the result. Edges that score high are formatting inheritances, where an heir adopted its predecessor's house style: WIL→SHS (0.81), PWG→PW (0.79), PWG→SCH (0.70), CCS→CAE (0.64). Edges that score low are content inheritances where the heir reformatted: PWG→MW (0.02), MW72→MW (0.29) — even though, by the sanhw1 content measure, 89–94% of those sources' lemmas recur in MW.

Monier-Williams absorbed the Petersburg lexicon while imposing its own orthographic standard (ṛ-stems as -ṛ not -ar; śatṛ as -at not -ant; -vas not -vaṃs). The gap between near-unity content-containment and near-zero convention-similarity is the quantitative signature of a re-edited, re-typeset descendant. Content and convention are orthogonal axes of descent, and their divergence localises editorial intervention — the full argument is in Paper H §5.

The two axes, plotted (Phase L0.7)

Every point is a dictionary pair: content similarity (sanhw1 lemma Jaccard, x) against convention similarity (1 − L0 distance, y). Points on the diagonal inherit both axes equally; points below-right (high content, low convention) are the reformatting events — shared vocabulary rendered in a different house style.

Ranked reformatting events

Directed inheritance edges (sanhw1 containment ≥ 0.85), scored by residual = content containment − convention similarity. The top rows — everything into MW (Monier-Williams), and WIL→YAT — are the heavy editorial recodings; the bottom rows (SHS↔WIL, PWG→PW, CCS→CAE) inherited form as well as substance.

Method & caveats

Fingerprint: 7 Patel conventions (source = patel2016, from his per-dict classification) + 18 auto-extracted markup dimensions. English-headword dicts (BOR, AE) are excluded by Patel from the Sanskrit-headword conventions; LRV, FRI are not in Patel's 36 and remain partially gated; KNA/KOW/AMAR lack a local source.
Distance: rare-option-weighted Hamming, missing-aware, with cell-level Jaccard for multi-valued conventions. Encoding choice barely moves the tree (Robinson–Foulds ≈ 0.07 between Jaccard and Hamming UPGMA); algorithm choice matters more (UPGMA vs NJ ≈ 0.5).
Canonical tree: 1000× dimension-bootstrap consensus UPGMA; full Bayesian MCMC deferred (design §9). Config B_whamming is pre-registered, not tuned to recovery.
Reproduce: scripts/L0/s2_fingerprint.py → s2b_patel_auto.py → s2d_patel_gold.py → s3_cladogram.py. Taxonomy & numbering in refs/fingerprint_conventions.md + refs/concordance.md.

← back to Dictionary genealogy · overview