Error typology of digital Sanskrit dictionaries
What kinds of errors are corrected in the Cologne Digital Sanskrit Lexicon, where
in the entry they occur, and how the profile changes over twelve years. Each of
the 50,953 correction events (2014–2026, 43 dictionaries) is normalized to IAST
and attributed to the dictionary microstructure component it repairs. See the
finding reports/obs_t_typology.md
and the design spec.
Correction events
Dictionaries
Correctors
Derived (not heuristic)
Two axes: location × edit-type
Each correction is described on two orthogonal axes — where in the entry it lands (the microstructure location) and what kind of change it is (the edit-type). Mixing them was a pitfall; keeping them apart is the honest typology.
Axis A — location (derived labels)
Where in the entry the correction repairs. Git layer is attributed positionally
from the source XML tags; the form layer is joined to csl-orig by headword.
Reported on derived labels (location is not guessed when the join fails).
Corrections concentrate in the sense (definition) and headword — the meaning-bearing fields.
Axis B — edit-type (all events)
What kind of change. Every category is a surface micro-edit; there is no "content rewrite" type — even corrections to definitions are small form fixes.
Twelve-year timelapse — location over time
Monthly correction volume, coloured by the location repaired. The form era and the git era meet at mid-2019 into one continuous record.
Scrub a year
Cross-dictionary error density
Corrections per 1,000 entries (<L> count), dictionaries with ≥30 events — a
size-normalized quality signal, so a small heavily-edited dictionary is not hidden
by a large one.
Character confusion — the clean Sanskrit signal
Single-character substitutions in the form layer (IAST), restricted to consonants: the genuine phoneme-confusion signal, led by the classic b ↔ v merger.
Crosswalk typologies
The same events under the OCR/digitization and textual-criticism (Katre) frames, derived from the edit-op trace.
Reference baselines
Stdlib-only, deterministic baselines that define the NLP tasks the released corpus
supports, on a temporal split. See reports/obs_t_baselines.md
and the datasheet.
Detection (char-LM, minimal pair)
pairwise accuracy (chance 0.5)
Correction (noisy-channel)
accuracy@1 —
Type classifier (Naive Bayes)
accuracy vs
Detection and correction are deliberately hard for context-free baselines — a one-character-different Sanskrit string is usually also plausible — which is the headroom a neural model is meant to fill. Error-type classification clearly beats the majority class.
Object of analysis: corrections over dictionary source text — in scope per
docs/BOUNDARY_RULES.md.
The lexicographic-structure interpretation cross-links to
csl-atlas.