A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

PID

This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.

Identifier
PID http://hdl.handle.net/11234/1-4615
Related Identifier https://nlp.fi.muni.cz/projects/ahisto/ocr-dataset
Related Identifier https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
Related Identifier https://starfos.tacr.cz/en/project/TL03000365
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-4615
Provenance
Creator Novotný, Vít; Seidlová, Kristýna; Vrabcová, Tereza; Horák, Aleš
Publisher Masaryk University, Brno
Publication Year 2021
Rights Public Domain Dedication (CC Zero); http://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language German; Czech; Latin; English
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; application/octet-stream; downloadable_files_count: 5
Discipline Linguistics