Paper — related work & positioning | Atlas of the Cologne Digital Sanskrit Lexicons

A short positioning note placing the atlas relative to the current machine dictionary-digitisation literature. It complements the grounded body, the triangulation against three lexicographic frameworks, and the framework appendices.

Trust Block

Evidence: Setiawan et al., MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models (University of Melbourne + LILT; code: DavidSamuell/MUDIDI), read against this repo's convention-fingerprint and microstructure outputs. A local PDF copy is kept with the project sources; it is not redistributed here pending confirmation of its licence.
Limitations: MUDIDI is an upstream digitisation/benchmark paper, not a test of the atlas's block-economy thesis; its support is for the general convention-priors claim, and its exact F1 figures are from a recent preprint and should be re-checked before citation.
Validation: checked by npm run build; claims about MUDIDI should be verified against the source paper.
Owner repo: csl-atlas.
Next use: read alongside the parse-rules framing and the candidate MDF export profile in csl-standards.

Positioning

Recent vision-language work treats dictionary digitisation as a two-stage problem — faithful page transcription, then parsing into a lexicographic schema — and the MUDIDI benchmark (Setiawan et al., 30 public-domain dictionaries across diverse scripts, Sanskrit–English among them) reports that the decisive, low-cost intervention for the parsing stage is not a bigger model but per-dictionary prior knowledge: supplying a dictionary's own introduction (its abbreviation keys and entry conventions) and a formal field schema each lift entry-field-assignment F1 by roughly 3–6 points, and substituting human-validated parse-rules for machine-inferred ones adds about 6 more. This is independent, quantified corroboration of the atlas's core methodological bet: that a dictionary's house-style conventions are not incidental but are the controlling signal for correct structural interpretation — exactly the knowledge the atlas already encodes by hand as the convention fingerprints (25 dimensions; Patel's seven canonical normalisation conventions plus eighteen auto-extracted), the structural register, and the eighteen-block microstructure apparatus. The atlas therefore stands one step downstream of MUDIDI — it analyses CDSL text that Cologne has already keyed, rather than recovering text from scans — but the two meet at the schema boundary: the atlas's per-dictionary convention profiles are precisely the validated parse-rules MUDIDI shows are worth the largest single F1 gain, and the candidate MDF export adds SIL's Multi-Dictionary Formatter as a third interoperability target beside the csl-standards TEI and OntoLex/FrAC views. (Note that MUDIDI exercises Sanskrit only in its transcription stage; its ten-dictionary parsing subset does not include Sanskrit — a gap the atlas's source-linked structured CDSL data is uniquely placed to fill.)

Related work: where the atlas sits in the digitisation pipeline

Trust Block

Positioning

See also