Convention fingerprints & the formatting cladogram
How do the Cologne dictionaries relate by house style — the orthographic and citation conventions in which they render Sanskrit, independent of which words they contain? This page builds a phylogenetic tree of CDSL dictionaries from a convention fingerprint: 25 categorical dimensions, of which the seven canonical ones are Dhaval Patel's 2016 normalization conventions, populated from Patel's own per-dictionary classification (ground truth from the convention author). The remaining dimensions are auto-extracted from each dictionary's source markup.
Companion to Dictionary genealogy (which measures shared content via the sanhw1 headword index) and to the Phase L0 design / results. The headline finding is that convention-lineage and content-lineage are distinct signals — see the discussion below.
Trust Block
- Evidence: L0 convention-fingerprint files under
src/data/lexicographic-structure/L0/, Patel 2016 assignments, and validation reports. - Limitations: convention lineage measures house style and markup practice; it is separate from content inheritance.
- Validation: checked by
npm run build; L0 validation details are indocs/L0_RESULTS.mdanddocs/L0_DESIGN.md. - Owner repo:
csl-atlas. - Next use: treat the chart as structural evidence, then check companion docs before making a lineage claim.
At a glance
The canonical convention cladogram
UPGMA on the rare-option-weighted Hamming distance over the 25 convention dimensions, with 1000× dimension-bootstrap consensus. Five readable clades emerge: the Petersburg formatting family (PWG/PW/SCH/CCS/CAE), the Latin/German etymological + MW group, the Anglo-Indian line (WIL/SHS/AP90/AP/MD), the indigenous + verb dicts (SKD/VCP/KRM), and a mixed/index cluster.
Bootstrap support for known lineage edges
Each bar is a documented inheritance edge; the value is the fraction of 1000 bootstrap trees in which the child falls among the parent's three nearest neighbours, with a 95% Wilson interval. Tier A = high-confidence (inventory + sanhw1 containment); tier B = scholarly hypothesis under test. The formatting edges (WIL→SHS, PWG→PW, PWG→SCH, CCS→CAE) score high; the content-only edges where the heir reformatted (PWG→MW, MW72→MW) score low — the core finding of this page.
The dashed line marks the 0.80 "strong edge" threshold (design §6.4).
Three algorithms agree on the strong edges (Phase L0-rigor)
The paper-final tree is checked against all three of the design's algorithms — UPGMA and Neighbour-Joining (500× character bootstrap) and a Bayesian Mk MCMC (2-state symmetric morphological model, Felsenstein pruning, NNI + branch-length Metropolis moves, 80k generations). Support = posterior/bootstrap probability that the pair sits in a shared clade of ≤ 4 leaves.
Two takeaways. The strong formatting edges — PWG→PW, PWG→SCH, WIL→SHS, AP90→AP — clear
the bar under every algorithm (Bayesian Mk is the most decisive on the Petersburg core,
1.00). The reformatted edges stay low under every algorithm (MW72→MW ≤ 0.43, WIL→YAT ≈ 0):
the convention ≠ content result is robust to the clustering method, not an artifact. Bayesian
Mk, sensitive to shared derived characters, additionally surfaces PW→CCS (0.74) and the
Bopp→MW hypothesis (0.65) that distance-bootstrap dilutes. Robinson–Foulds between the three
point estimates: UPGMA–NJ 0.59, NJ–Bayes 0.45, UPGMA–Bayes 0.70 (data/L0/bayesian_report.json).
Patel 2016 convention taxonomy
The seven canonical conventions and the dictionaries Patel assigns to each option. Conventions are multi-valued — a dictionary may follow several options.
Per-dictionary convention fingerprint (Patel's 7)
Each cell is the +-joined option set the dictionary follows. Dicts sharing a row pattern
share a house style — note how PWG / PW / SCH are identical and CCS differs by one.
Convention-distance heatmap
Pairwise rare-option-weighted Hamming distance over the 25 dimensions (0 = identical house style, darker = closer). The bright low-distance blocks are the formatting families.
Convention-lineage is not content-lineage
The pattern of which edges the convention tree recovers — and which it misses — is the result. Edges that score high are formatting inheritances, where an heir adopted its predecessor's house style: WIL→SHS (0.81), PWG→PW (0.79), PWG→SCH (0.70), CCS→CAE (0.64). Edges that score low are content inheritances where the heir reformatted: PWG→MW (0.02), MW72→MW (0.29) — even though, by the sanhw1 content measure, 89–94% of those sources' lemmas recur in MW.
Monier-Williams absorbed the Petersburg lexicon while imposing its own orthographic
standard (ṛ-stems as -ṛ not -ar; śatṛ as -at not -ant; -vas not -vaṃs). The
gap between near-unity content-containment and near-zero convention-similarity is the
quantitative signature of a re-edited, re-typeset descendant. Content and convention are
orthogonal axes of descent, and their divergence localises editorial intervention — the
full argument is in Paper H §5.
The two axes, plotted (Phase L0.7)
Every point is a dictionary pair: content similarity (sanhw1 lemma Jaccard, x) against convention similarity (1 − L0 distance, y). Points on the diagonal inherit both axes equally; points below-right (high content, low convention) are the reformatting events — shared vocabulary rendered in a different house style.
Ranked reformatting events
Directed inheritance edges (sanhw1 containment ≥ 0.85), scored by residual = content containment − convention similarity. The top rows — everything into MW (Monier-Williams), and WIL→YAT — are the heavy editorial recodings; the bottom rows (SHS↔WIL, PWG→PW, CCS→CAE) inherited form as well as substance.
Method & caveats
- Fingerprint: 7 Patel conventions (
source = patel2016, from his per-dict classification) + 18 auto-extracted markup dimensions. English-headword dicts (BOR, AE) are excluded by Patel from the Sanskrit-headword conventions; LRV, FRI are not in Patel's 36 and remain partially gated; KNA/KOW/AMAR lack a local source. - Distance: rare-option-weighted Hamming, missing-aware, with cell-level Jaccard for multi-valued conventions. Encoding choice barely moves the tree (Robinson–Foulds ≈ 0.07 between Jaccard and Hamming UPGMA); algorithm choice matters more (UPGMA vs NJ ≈ 0.5).
- Canonical tree: 1000× dimension-bootstrap consensus UPGMA; full Bayesian MCMC deferred (design §9). Config
B_whammingis pre-registered, not tuned to recovery. - Reproduce:
scripts/L0/s2_fingerprint.py→s2b_patel_auto.py→s2d_patel_gold.py→s3_cladogram.py. Taxonomy & numbering inrefs/fingerprint_conventions.md+refs/concordance.md.