OCRopodium
Optical Character Recognition (OCR) is a key component of large-scale digitisation projects that deal with text-based material. Typically such digitisation projects make use of closed, propriety software or commercial companies.
This raises a number of issues, such as: the cost of proprietary software and/or external consultants; lack of flexibility and asaptability of closed software; deskilling of digitisation staff, as OCR expertise is concentrated in commercial companies; appropriateness of the software to historical material.
The OCRopodium project will address some of these issues by:
- Trialling an open-source approach to Optical Character Recognition, using OCRopus software.
- Embedding OCR activities within flexible, semi-automated digitisation workflows for text-based material.
Using a collaborative, distributed and semi-automated workflow embedded in institutional practices will help address the digitisation process from scanning, through OCR and mark-up, to ingest into a repository where the content is managed and preserved.
This project will help close a significant skills gap by reducing the reliance on commercial OCR providers in favour of open source OCR technology, which will allow adaption and development through community involvement.
Download the project plan (PDF)
Project Staff
Mark Hedges, Deputy Director, Centre for e-Research, King's College, London