Natas - Python 3 library for processing historical English - Dataset

Dataset

Natas - Python 3 library for processing historical English

PID

This library will have methods for processing historical English corpora, especially for studying neologisms. The first functionalities to be released relate to normalization of historical spelling and OCR post-correction.

Cite If you use the library, please cite one of the following publications depending on whether you used it for normalization or OCR correction.

Normalization

Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä. 2019. Revisiting NMT for Normalization of Early English Letters. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature.

OCR correction

Mika Hämäläinen, and Simon Hengchen. 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In the Proceedings of Recent Advances in Natural Language Processing.

Identifier
PID	http://hdl.handle.net/11304/0f5c990d-7f7e-4e12-ba41-7bec41b26a03
Metadata Access	https://b2share.eudat.eu/api/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:b2share.eudat.eu:b2rec/2a9a25be10e442e2a75f8c688a1c82c4

Provenance
Creator	Hämäläinen, Mika; Hengchen, Simon; Säily, Tanja; Rueter, Jack; Tiedemann, Jörg; Mäkelä, Eetu
Publisher	CLARIN
Publication Year	2020
Rights	info:eu-repo/semantics/openAccess; Apache-2.0 License
OpenAccess	true

Representation
Discipline	Linguistics