Issue Taxonomy

How digitisation work is represented in GitHub issues and PR labels. This page does not measure dictionary-entry coverage directly.

Issue typology evolution

The stacked area chart shows how the mix of issue types shifted as the project matured. In the early years (2014–2018), text-correction dominated overwhelmingly — the project was in raw digitisation mode, catching OCR errors and transcription mistakes. From 2019 onwards, link-target and markup issues grew as the project moved towards building richer web-display features and structured data. Since 2022, enhancement and bug issues have appeared in meaningful numbers, reflecting a maturing toolset that is now being actively improved rather than just populated with content corrections.

How to read: Each coloured band is one issue-type label; bands are stacked so the total height equals all issues opened that year. Example 1: A thickening text-correction band in 2020–2022 means correction issues dominated that peak period — the bulk csl-orig correction campaigns are visible here. Example 2: A new colour appearing at the top of the stack in a recent year marks the emergence of a label type that was rarely used before — for instance, enhancement growing visibly after 2022 signals the project's shift towards feature work.

Conclusion: text-correction dominated the issue base for most of the project's life, but since 2022 markup, link-target, and enhancement have all grown, signalling a shift from raw correction work towards structural improvement and web-display feature development. The project's issue history is a direct readout of its digitisation roadmap.

Taxonomy adoption & conformance

Conformance means an issue carries exactly one type label, exactly one severity label, and a milestone — the rule in the org CLAUDE.md. Labels were applied retroactively by the runbook, so this is coverage across the historical issue base, bucketed by the year each issue was opened. Full finding: reports/taxonomy_adoption.md (generated by scripts/taxonomy_adoption.py).

Issues classified

Carry a type label

Fully conformant

Over-typed (>1 type)

How to read: Four lines track different conformance thresholds for the same cohort of issues: typed (carries at least one type label), severity (carries a severity label), milestone (assigned to a milestone), and conformant (all three simultaneously). The conformant line is always at or below the lowest of the other three. Example 1: A year where all four lines cluster near 100% means nearly every issue opened that year was fully tagged — the taxonomy was working well for that cohort. Example 2: A large gap between the typed line and the conformant line in a given year reveals the bottleneck: type labels are applied but severity or milestone assignment is lagging.

Each line is the share of that year's issues meeting one requirement. Conformance climbs from the low-20s before the taxonomy matured (2014–2018) to a 92% peak in 2025; the 2026 dip reflects recently-opened issues not yet fully triaged (no milestone assigned).

Conclusion: The taxonomy rollout was effectively complete by 2024–2025, with conformance near 92% at peak. The 2026 dip is expected — newly opened issues are not yet milestoned — and not a sign of declining standards. The data shows that retroactive label application (via the runbook) was largely successful in bringing the historical issue base into conformance.

Issue type distribution (all-time, top labels)

The all-time label distribution aggregates every issue ever opened across all repositories, showing the total historical footprint of each label type. This differs from the year-by-year typology chart: rather than showing trends, it shows the cumulative weight of each category over the org's entire history. The result is a direct readout of how the project has spent its collective attention.

How to read: Each bar is one issue label; its length equals the total number of issues carrying that label across all repositories and all time. Example 1: If text-correction sits at the top with several thousand issues, it means correcting OCR and transcription errors is and has always been the dominant activity of this project. Example 2: The (unlabeled) bar, if present, shows how many issues never received a type label — the pre-taxonomy historical backlog.

Conclusion: text-correction and link-target together account for the majority of the project's entire issue history, directly mirroring its two dominant workflow phases: first correcting OCR errors in the scanned dictionary text, then building clickable links from source references to scanned PDF pages. The distribution is the project's work log turned into a bar chart.

Open vs closed by repo (top 20 most active)

This chart compares the total number of issues ever opened against the number closed for the 20 most active repositories. The ratio of open to closed tells two different stories depending on the repo: for high-volume correction repos like csl-orig and MWS, a high closure rate signals that bulk correction campaigns ran to completion; for smaller repos with persistent open segments, it reveals where current backlog is concentrated.

How to read: Each bar is split into closed (green) and open (amber); the total length equals all issues ever opened in that repository. Bars are sorted by total issue count. Example 1: A bar that is mostly green means the repo's historic backlog has been largely resolved — corrections were opened and closed in coordinated campaigns. Example 2: A long amber segment at the right end of a bar, especially for a smaller repo, means significant open work remains — either an active campaign in progress or issues that have been triaged but not yet addressed.

Conclusion: The correction-heavy repositories (csl-orig, MWS) carry the most issues in absolute terms but also show high closure rates — their campaigns ran to completion. The current open backlog is concentrated in a smaller number of repos where correction campaigns are still in progress or triaging has not kept pace with issue creation. The open-vs-closed split is a practical triage dashboard: amber-heavy bars are the next places to focus review effort.

← back to overview