CSL Observatory

A living, fully-open measurement of the Cologne Digital Sanskrit Lexicon — 13 years of volunteer work digitising and correcting the foundational Sanskrit dictionaries, turned into citable, reproducible data. Every figure below is computed live from datasets you can download and reuse; nothing here is hand-typed.

Repos tracked

Total issues+PRs

Total commits

Human contributors

Key findings

Across 13 years and repositories, a small, dedicated team has logged tens of thousands of dictionary corrections entirely in the open — and the error-typology study turns of them, across dictionaries, into a published, reusable language resource. Four offline, reproducible analyses describe the organisation behind that work: productive, well-governed, and actively maintained — though carried by a tiny core and still thin on reuse metadata. Full write-up: synthesis report.

Concentration — the core trio carries % of all contributions, and of repos have a bus factor of 1. Community →
Activity — thousands of commits, yet the busiest year drew only distinct authors: volume-per-person, not a growing base. Activity →
Process — % of issues are fully taxonomy-conformant, after adoption climbed to a 92% peak in 2025. Issue taxonomy →
Hygiene — of repos carry no license, and no contributor has a registered ORCID. Repo health →

Lead figure: How the work changed over 13 years

This chart is the single most expressive summary of the project's history. For most of its existence, the org's issue activity was almost entirely text-correction — catching OCR errors, transcription mistakes, and print-digitisation artefacts in the scanned dictionary text. Around 2019, link-target began to grow as the project built clickable links from dictionary source references to scanned PDF pages. More recently, markup, enhancement, and bug issues have grown, signalling the project's shift from raw content correction towards structured data quality and web-display features.

How to read: Each coloured band is one issue-type label; stacking shows the total issues opened that year across all repos. The total height of the stack in any year is the volume of that year's issue activity. Example 1: A year where text-correction fills the entire bar almost to the top means nearly all issues that year were correction tickets — the project was in pure digitisation mode. Example 2: Watching the top of the stack change colour over time reveals which new workstreams emerged: link-target appearing and growing after 2018 is the visual signature of the dictionary-to-book linking phase beginning.

Conclusion: The org's issue history is a direct readout of its digitisation roadmap — first correct the text, then build the links, then improve the structure. The recent growth of enhancement and bug issues reflects a mature project that is now maintaining and improving a stable infrastructure rather than racing to complete a corpus.

Annual throughput

The annual throughput chart compares the two most fundamental issue metrics: how many issues were opened each year versus how many were closed. The gap between them in any given year reveals whether the backlog was growing or shrinking. Years where openings far exceed closings correspond to campaign launches; years where closings exceed openings correspond to focused resolution sprints or the tail of a completed campaign.

How to read: Two side-by-side bars for each year — blue for issues opened, green for issues closed. A taller blue bar than green bar in a year means the backlog grew; taller green means it shrank. Example 1: A year with a blue bar twice the height of the green bar indicates an intense campaign launch that created far more work than was resolved — issues were opened in bulk as a tracking mechanism. Example 2: A year where the green bar matches or exceeds the blue bar shows the project resolving work as fast as it opens it — a steady-state or catch-up mode.

Conclusion: The opened/closed gap is largest in 2020–2022, the peak of the correction campaigns, and narrows in 2023–2025 as those campaigns wound down. The 2026 data shows closings keeping pace with openings — a healthier balance, consistent with the backlog-reduction trend visible in the Activity page.

Top 10 most active repositories (all-time)

The ten most active repositories by lifetime activity (issues + PRs + commits combined) reveal which parts of the org have absorbed the most collective work. This is dominated by the two central source-data repositories — csl-orig and whichever dictionary has been most actively corrected — plus the web infrastructure repos that are touched on every deploy cycle. Knowing which repos are most active also predicts where bugs, inconsistencies, and coordination bottlenecks are most likely to arise.

How to read: Each horizontal bar is one repository; its length equals the sum of all issues opened, pull requests opened, and commits ever recorded for that repo. Repos are sorted from most to least active, showing only the top 10. Example 1: If csl-orig leads by a wide margin, it means the canonical correction source repository has absorbed more combined activity than all other repos — expected, since every dictionary correction flows through it. Example 2: A tooling repo appearing in the top 10 (such as csl-pywork or csl-websanlexicon) indicates that the infrastructure layer has needed ongoing, intensive maintenance alongside the dictionary content work.

Conclusion: The top-10 list is heavily skewed towards csl-orig and a handful of high-correction dictionaries, confirming that the project's activity is driven by content work rather than infrastructure churn. A tooling repo appearing in the top 10 is a signal worth investigating — it may indicate fragile tooling that requires frequent fixes, or an actively developed capability.

Ops Command - maintainer operating dashboard across blockers, issue pressure, metadata, and bus factor
Activity — issue/commit/PR throughput timelines, heatmaps, GitHub-style year grids
OBS-T Maintenance - light operational checks for the correction typology release
Issue taxonomy — GitHub issue and PR label patterns by repo
Taxonomy Triage - label quality, conformance, and open issue triage views
Community — contributor growth, retention, bus-factor analysis
Community Continuity - maintainer concentration, retention, and identity readiness
Repository Health — licensing, default-branch, and hygiene audit
Repository Risk - deeper license, branch, flag, size, and cleanup-risk charts
Metadata Readiness - B3 documentation, automation, release, and unknown-field blockers
Tech Stack — language evolution, dependency graphs, runbook adoption
Repository Benchmarks — project-level openness and repository evidence
Data — raw downloads (CSV, JSON, Parquet) for reproducibility

About & how to cite

The observatory is an open-source project of the sanskrit-lexicon organisation, part of the Cologne Digital Sanskrit Dictionaries effort. It measures only the org's own GitHub activity — repositories, issues, commits, contributors — and the public correction record; the dictionary content itself lives in the upstream dictionary repos. Code and data are released under open licences (GPL-3.0 for code, CC-BY-4.0 for the datasets).

To cite the data, see Data downloads → Citation. The error-typology corpus has its own datasheet.

Data snapshot: , refreshed monthly from the GitHub API — see how it's built and Data downloads for the exact figures.