Skip to content
Home Industries Publishing & Historic Research
07 Industry · Publishing & Historic Research

The handwriting nobody else could read.

Libraries, archives and genealogy publishers turn to us for the corpora that defeat commercial OCR — multilingual hands, faded ink, marginalia. Domain-trained models read what they can; reviewers train the model on what they can’t.

95% Accuracy with HITL
~80% Raw model accuracy
Millions Of records indexed
12 Languages & scripts
I The industry

Where the record is the institution.


Publishers, libraries and research programmes sit on corpora that defeat commercial OCR — parish registers in old Norse, eighteenth-century legal hands, microfilmed debates across mixed scripts. Honest baseline: a domain-trained model reads about 80% of these pages cleanly. The remaining 20% is where the work is — paleographer-led review, corrections fed back into training, accuracy climbing release after release. We have been doing this since 2005.

The model is the easy part. The reviewer loop is the deliverable.

Trusted in this industry
The British Library The National Archives (UK) Princeton UC Berkeley Ancestry Kerala Media Academy
II Key challenges

The briefs nobody else says yes to.


Six structural problems we hear repeatedly from publishers, libraries and research programmes — and the honest answer to each.

/ 01 The corpus

Handwriting no model has seen

Old Norse, mixed Devanagari, eighteenth-century legal hands. Off-the-shelf OCR collapses on lines a reviewer reads in seconds.

HWR · Multi-script
/ 02 Accuracy reality

Raw 80% is the ceiling

Even domain-trained, models top out near 80% on hard hands. The 95% number is earned through reviewer corrections, not assumed.

Active learning · HITL
/ 03 Citation & rights

Every record traceable to source

Transcriptions face scholarly review. Rights chains and embargoes have to hold through the publishing pipeline.

Citation · Rights · GDPR
/ 04 Legacy schemas

MARC, EAD, MODS — without breakage

Migration without dead inbound citations or losing the catalogue's existing identifiers.

MARC · EAD · MODS
/ 05 Throughput

On a publishing calendar

The catalogue ships on a schedule. SLAs by script, language and quality band — not aspirational.

Throughput · SLOs
/ 06 Reviewer UX

Built for the curator, not the engineer

The reviewer is an archivist or paleographer. Tooling speaks their vocabulary; their corrections retrain the model.

HITL · Curator UX
III Our solutions

What we ship into publishing operations.


Three workstreams that run together. The output is one audited corpus — not a stack of pilots.

01 Custom Software

A console the curator wants to use

A reviewer-grade indexing platform built around the archivist's workflow — not the engineer's.

React · TS FastAPI IIIF viewer
  • Reviewer console — side-by-side image / transcription, paleography aids, citation tools.
  • Schema-aware fields — MARC, EAD, MODS, Dublin Core out of the box.
  • Per-keystroke audit with rollback to any prior state of any record.
02 AI + Human-in-the-loop

The model reads. The reviewer trains it.

Domain-trained HWR gets to ~80% on hard hands. The 95% number comes from the active-learning loop: every reviewer correction is training data, retraining on a weekly cadence, with eval gates between runs.

HWR Active learning HITL
  • Custom HWR models per script, period and hand — paleographer-led labelling.
  • Active-learning loop — corrections become training data within the same week.
  • Eval gates on every retrain — accuracy moves up, never down.
03 Integrations

Published into the catalogue, not an island

Outputs flow into your library system and repository — without breaking the citations external scholars already use.

MARC21 METS IIIF OAI-PMH
  • Catalogue integration — Alma, Aleph, Koha via MARC21 / Z39.50.
  • Repository feeds — DSpace, Fedora, Islandora, IIIF.
  • Persistent IDs — DOI / ARK / handle minting; inbound citations preserved across migration.
IV Product mapping

Actigen Archive — the publishing edition.


The Actigen module for libraries, archives and publishers. Reviewer-led, retraining weekly, shipping audit-grade output.

View the platform
A·a
Actigen 2.0 · Module

Actigen Archive

A complete archives-and-publishing pipeline. Configurable per script, period and schema. Ships with the reviewer console, model registry, audit pack and catalogue integrations already wired in.

Domain HWR registry
Per-script model selection. Versioned, eval-gated, rollback-safe.
Paleography console
Side-by-side image / transcription with palaeographic aids.
Schema mapper
MARC21, EAD, MODS, Dublin Core — field-level round-trip.
IIIF-native viewer
Manifests auto-generated. Deep-zoom, annotation, citation links.
Active-learning loop
Weekly retrain cadence. Eval gates enforced before promotion.
Lineage & audit pack
Per-record provenance. METS / PREMIS export on demand.
V Use cases

Real briefs, real corpora.


Six briefs we are running this year — each anchored in a specific corpus and reviewer audience.

01

Vital records

Parish registers, civil registries, church books — converted into searchable, citation-grade records.

Genealogy publishers · State registries
Corpus 15th–20th c. · Norse, Latin, Gothic
02

Legislative archives

Hansards, debates, bills and committee proceedings — searchable across script and language reforms.

Legislatures · Parliamentary libraries
Reference Kerala Legislative Assembly · 1888–present
03

Scholarly editions

Variant collation, marginalia, footnote linking — structured for digital editions with TEI export.

University presses · Research programmes
Standards TEI P5 · METS · IIIF · BIBFRAME
04

Newspaper archives

Article-level segmentation, byline extraction, topic indexing — published into reader and search products.

News publishers · Media archives
Output Article-level OAI-PMH · IIIF manifests
05

Manuscripts & rare books

Codices, charters, illuminated manuscripts. Conservation-aware capture, paleographer-led labelling.

National libraries · Special collections
Reference British Library · National Archives (UK)
06

Subscription publisher ops

Continuous indexing for subscription publishers — with throughput SLAs and quality-band reporting.

Genealogy publishers · Subscriptions
Reference Ancestry · domain HWR partner
VI Business outcomes

What changes after the work ships.


Three outcomes consistently reported. Per-engagement targets agreed in writing during Discover.

/ Efficiency

Reviewers become a quality function

Once the model clears its eval gate, reviewer hours shift from transcription to verification — and corrections retrain the model.

  • 3–5× reviewer throughput uplift
  • Time-to-95% halves with each retrain cycle
  • Edge cases isolated; bulk is automated
/ Cost optimisation

Cost per record on a curve that bends

Unit cost drops as the active-learning loop closes. Most publishers see meaningful reduction within the first quarter.

  • 40–60% cost reduction vs. manual indexing
  • Cost per record on a measurable downward curve
  • Reviewer effort concentrated where it matters
/ Scalability

From one corpus to a programme

Additional corpora — new scripts, new periods — onboard in weeks. The infrastructure is reusable; only the domain model is new.

  • New corpus onboarded in 4–8 weeks
  • Reviewer teams scale without retooling
  • One audit-pack standard across every corpus
VII Industry metrics

The numbers publishers ask for.


Numbers that map to how libraries, publishers and research programmes report up — boards, funders, subscribers.

~80%
Raw HWR accuracy on hard hands
Domain-trained baseline
95%
Field accuracy after HITL & retrain
Typical post-loop outcome
3–5×
Reviewer throughput uplift
After active-learning closes
12
Languages & scripts in active production
Across current engagements
100%
Records with full lineage & audit trail
By default
99.1%
On-time delivery across SBL projects
Across 4,000+ projects
Typical ranges. Per-engagement targets agreed during Discover.
Get a free consultation

Have a corpus that won’t fit a commercial pipeline? Send a sample folio — and a measured pilot proposal will be returned within the working week.

Start the conversation
VIII Case studies

Work the scholar can still cite.


Four engagements where the corpus shipped, the lineage held, and the citations survived scholarly review.

80 million Norwegian records. Fifteenth to nineteenth century. Ninety-nine-point-five per cent.

No commercial model was trained on the handwriting. We engaged a Norwegian genealogist to label, fine-tuned a domain HWR model, and wrapped it in a custom indexing platform with reviewer-grade HITL. The accuracy threshold was cleared, not approached.

Delivered on schedule and above threshold. Every discrepancy remained traceable through the lineage layer.
Read the case

6.5 million herbarium sheets, turned from closed collection into a global research engine.

A productised heritage operating model for the world's most significant botanical archives. High-resolution capture, transcription of historical scripts, metadata indexing, and global scientific accessibility — end-to-end. Hand-signed Darwin sheets included.

6.5 million sheets digitised. Two centuries of botanical history made searchable for climate and biodiversity research worldwide.
Read the case
X Compliance & standards

The standards your auditor cares about.


Externally verifiable credentials and the standards every output is engineered to satisfy. The audit pack travels with the system.

/ Data security

Information security & privacy

External audits aligned to ISO 27001 and 27701. Privacy extends to reviewer roles, embargoes and rights-bearing material.

  • ISO 27001 · ISO 27701 (PIMS)
  • GDPR · Article 89 research provisions
  • Reviewer-role RBAC with embargo enforcement
/ Regulatory

Rights & accessibility

Copyright, embargoes and accessibility — handled in the platform, not in policy documents.

  • Per-record copyright clearance workflows
  • Rights metadata · RightsStatements.org
  • WCAG 2.2 AA compliance
/ Governance

Cataloguing & preservation

The standards your catalogue, repository and subscribers expect — implemented natively.

  • MARC21 · EAD · MODS · Dublin Core
  • METS / PREMIS · OAIS · IIIF · TEI P5
  • OAI-PMH · DOI / ARK / handle
SBL credentials ISO 9001 QUALITY ISO 27001 SECURITY ISO 27701 PRIVACY CMMI LEVEL 3 MATURITY GDPR DATA PROTECTION
100+ clients trusted 20+ years in regulated technology 4,000+ projects delivered 99.1% on-time
The model alone wouldn't have got us there. What worked was the loop — our reviewers correcting, the system retraining, accuracy climbing release after release. By the third quarter we were publishing on schedule.
Editorial Director A genealogy publisher · Norwegian HWR programme
Genealogy publisher HWR · Norse / Latin ~80% raw → 95% with HITL Active learning · weekly retrain
XI Tell us about your project

Send a sample folio. Receive a measured proposal.


Send a folio. Receive a measured pilot proposal within the working week.

Phone+44 791 884 7631
IndustryPublishing & Historic Research
Contact — main