A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents - Dataset - B2FIND

Dataset

A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

PID

This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.

Identifier
PID	http://hdl.handle.net/11234/1-4615
Related Identifier	https://nlp.fi.muni.cz/projects/ahisto/ocr-dataset
Related Identifier	https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
Related Identifier	https://starfos.tacr.cz/en/project/TL03000365
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-4615

Provenance
Creator	Novotný, Vít; Seidlová, Kristýna; Vrabcová, Tereza; Horák, Aleš
Publisher	Masaryk University, Brno
Publication Year	2021
Rights	Public Domain Dedication (CC Zero); http://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	German; Czech; Latin; English
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; application/octet-stream; downloadable_files_count: 5
Discipline	Linguistics