Error typology of digital Sanskrit dictionaries

What kinds of errors are corrected in the Cologne Digital Sanskrit Lexicon, where in the entry they occur, and how the profile changes over twelve years. Each of the 50,953 correction events (2014–2026, 43 dictionaries) is normalized to IAST and attributed to the dictionary microstructure component it repairs. See the finding reports/obs_t_typology.md and the design spec.

Correction events

Dictionaries

Correctors

Derived (not heuristic)

Two axes: location × edit-type

Each correction is described on two orthogonal axes — where in the entry it lands (the microstructure location) and what kind of change it is (the edit-type). Mixing them was a pitfall; keeping them apart is the honest typology.

Axis A — location (derived labels)

Where in the entry the correction repairs. Git layer is attributed positionally from the source XML tags; the form layer is joined to csl-orig by headword. Reported on derived labels (location is not guessed when the join fails).

Corrections concentrate in the sense (definition) and headword — the meaning-bearing fields.

Axis B — edit-type (all events)

What kind of change. Every category is a surface micro-edit; there is no "content rewrite" type — even corrections to definitions are small form fixes.

Twelve-year timelapse — location over time

Monthly correction volume, coloured by the location repaired. The form era and the git era meet at mid-2019 into one continuous record.

Scrub a year

Cross-dictionary error density

Corrections per 1,000 entries (<L> count), dictionaries with ≥30 events — a size-normalized quality signal, so a small heavily-edited dictionary is not hidden by a large one.

Character confusion — the clean Sanskrit signal

Single-character substitutions in the form layer (IAST), restricted to consonants: the genuine phoneme-confusion signal, led by the classic b ↔ v merger.

Crosswalk typologies

The same events under the OCR/digitization and textual-criticism (Katre) frames, derived from the edit-op trace.

Reference baselines

Stdlib-only, deterministic baselines that define the NLP tasks the released corpus supports, on a temporal split. See reports/obs_t_baselines.md and the datasheet.

Detection (char-LM, minimal pair)

pairwise accuracy (chance 0.5)

Correction (noisy-channel)

accuracy@1 — reachable at dist-1

Type classifier (Naive Bayes)

accuracy vs majority

Detection and correction are deliberately hard for context-free baselines — a one-character-different Sanskrit string is usually also plausible — which is the headroom a neural model is meant to fill. Error-type classification clearly beats the majority class.

Object of analysis: corrections over dictionary source text — in scope per docs/BOUNDARY_RULES.md. The lexicographic-structure interpretation cross-links to csl-atlas.