Related work: where the atlas sits in the digitisation pipeline

A short positioning note placing the atlas relative to the current machine dictionary-digitisation literature. It complements the grounded body, the triangulation against three lexicographic frameworks, and the framework appendices.

Trust Block

Positioning

Recent vision-language work treats dictionary digitisation as a two-stage problem — faithful page transcription, then parsing into a lexicographic schema — and the MUDIDI benchmark (Setiawan et al., 30 public-domain dictionaries across diverse scripts, Sanskrit–English among them) reports that the decisive, low-cost intervention for the parsing stage is not a bigger model but per-dictionary prior knowledge: supplying a dictionary's own introduction (its abbreviation keys and entry conventions) and a formal field schema each lift entry-field-assignment F1 by roughly 3–6 points, and substituting human-validated parse-rules for machine-inferred ones adds about 6 more. This is independent, quantified corroboration of the atlas's core methodological bet: that a dictionary's house-style conventions are not incidental but are the controlling signal for correct structural interpretation — exactly the knowledge the atlas already encodes by hand as the convention fingerprints (25 dimensions; Patel's seven canonical normalisation conventions plus eighteen auto-extracted), the structural register, and the eighteen-block microstructure apparatus. The atlas therefore stands one step downstream of MUDIDI — it analyses CDSL text that Cologne has already keyed, rather than recovering text from scans — but the two meet at the schema boundary: the atlas's per-dictionary convention profiles are precisely the validated parse-rules MUDIDI shows are worth the largest single F1 gain, and the candidate MDF export adds SIL's Multi-Dictionary Formatter as a third interoperability target beside the csl-standards TEI and OntoLex/FrAC views. (Note that MUDIDI exercises Sanskrit only in its transcription stage; its ten-dictionary parsing subset does not include Sanskrit — a gap the atlas's source-linked structured CDSL data is uniquely placed to fill.)

See also


Tour page. The canonical microanalysis paper lives in MWS docs-pass; MUDIDI is external related work, summarised here for positioning.