CSL Observatory
A living, fully-open measurement of the Cologne Digital Sanskrit Lexicon — 13 years of volunteer work digitising and correcting the foundational Sanskrit dictionaries, turned into citable, reproducible data. Every figure below is computed live from datasets you can download and reuse; nothing here is hand-typed.
Repos tracked
Total issues+PRs
Total commits
Human contributors
Key findings
Across 13 years and
- Concentration — the core trio carries
% of all contributions, and of repos have a bus factor of 1. Community → - Activity — thousands of commits, yet the busiest year drew only
distinct authors: volume-per-person, not a growing base. Activity → - Process —
% of issues are fully taxonomy-conformant, after adoption climbed to a 92% peak in 2025. Issue taxonomy → - Hygiene —
of repos carry no license, and no contributor has a registered ORCID. Repo health →
Lead figure: How the work changed over 13 years
This chart is the single most expressive summary of the project's history. For most of its existence, the org's issue activity was almost entirely text-correction — catching OCR errors, transcription mistakes, and print-digitisation artefacts in the scanned dictionary text. Around 2019, link-target began to grow as the project built clickable links from dictionary source references to scanned PDF pages. More recently, markup, enhancement, and bug issues have grown, signalling the project's shift from raw content correction towards structured data quality and web-display features.
How to read: Each coloured band is one issue-type label; stacking shows the total issues opened that year across all repos. The total height of the stack in any year is the volume of that year's issue activity. Example 1: A year where
text-correctionfills the entire bar almost to the top means nearly all issues that year were correction tickets — the project was in pure digitisation mode. Example 2: Watching the top of the stack change colour over time reveals which new workstreams emerged:link-targetappearing and growing after 2018 is the visual signature of the dictionary-to-book linking phase beginning.
Conclusion: The org's issue history is a direct readout of its digitisation roadmap — first correct the text, then build the links, then improve the structure. The recent growth of
enhancementandbugissues reflects a mature project that is now maintaining and improving a stable infrastructure rather than racing to complete a corpus.
Annual throughput
The annual throughput chart compares the two most fundamental issue metrics: how many issues were opened each year versus how many were closed. The gap between them in any given year reveals whether the backlog was growing or shrinking. Years where openings far exceed closings correspond to campaign launches; years where closings exceed openings correspond to focused resolution sprints or the tail of a completed campaign.
How to read: Two side-by-side bars for each year — blue for issues opened, green for issues closed. A taller blue bar than green bar in a year means the backlog grew; taller green means it shrank. Example 1: A year with a blue bar twice the height of the green bar indicates an intense campaign launch that created far more work than was resolved — issues were opened in bulk as a tracking mechanism. Example 2: A year where the green bar matches or exceeds the blue bar shows the project resolving work as fast as it opens it — a steady-state or catch-up mode.
Conclusion: The opened/closed gap is largest in 2020–2022, the peak of the correction campaigns, and narrows in 2023–2025 as those campaigns wound down. The 2026 data shows closings keeping pace with openings — a healthier balance, consistent with the backlog-reduction trend visible in the Activity page.
Top 10 most active repositories (all-time)
The ten most active repositories by lifetime activity (issues + PRs + commits combined) reveal which parts of the org have absorbed the most collective work. This is dominated by the two central source-data repositories — csl-orig and whichever dictionary has been most actively corrected — plus the web infrastructure repos that are touched on every deploy cycle. Knowing which repos are most active also predicts where bugs, inconsistencies, and coordination bottlenecks are most likely to arise.
How to read: Each horizontal bar is one repository; its length equals the sum of all issues opened, pull requests opened, and commits ever recorded for that repo. Repos are sorted from most to least active, showing only the top 10. Example 1: If csl-orig leads by a wide margin, it means the canonical correction source repository has absorbed more combined activity than all other repos — expected, since every dictionary correction flows through it. Example 2: A tooling repo appearing in the top 10 (such as csl-pywork or csl-websanlexicon) indicates that the infrastructure layer has needed ongoing, intensive maintenance alongside the dictionary content work.
Conclusion: The top-10 list is heavily skewed towards csl-orig and a handful of high-correction dictionaries, confirming that the project's activity is driven by content work rather than infrastructure churn. A tooling repo appearing in the top 10 is a signal worth investigating — it may indicate fragile tooling that requires frequent fixes, or an actively developed capability.
Navigation
- Ops Command - maintainer operating dashboard across blockers, issue pressure, metadata, and bus factor
- Activity — issue/commit/PR throughput timelines, heatmaps, GitHub-style year grids
- OBS-T Maintenance - light operational checks for the correction typology release
- Issue taxonomy — GitHub issue and PR label patterns by repo
- Taxonomy Triage - label quality, conformance, and open issue triage views
- Community — contributor growth, retention, bus-factor analysis
- Community Continuity - maintainer concentration, retention, and identity readiness
- Repository Health — licensing, default-branch, and hygiene audit
- Repository Risk - deeper license, branch, flag, size, and cleanup-risk charts
- Metadata Readiness - B3 documentation, automation, release, and unknown-field blockers
- Tech Stack — language evolution, dependency graphs, runbook adoption
- Repository Benchmarks — project-level openness and repository evidence
- Data — raw downloads (CSV, JSON, Parquet) for reproducibility
About & how to cite
The observatory is an open-source project of the sanskrit-lexicon organisation, part of the Cologne Digital Sanskrit Dictionaries effort. It measures only the org's own GitHub activity — repositories, issues, commits, contributors — and the public correction record; the dictionary content itself lives in the upstream dictionary repos. Code and data are released under open licences (GPL-3.0 for code, CC-BY-4.0 for the datasets).
To cite the data, see Data downloads → Citation. The error-typology corpus has its own datasheet.
Data snapshot: