ACL word segmentation correction

DOI

The data in this collection consists of two parallel directories, one ("raw") containing the raw text of 18850 articles from the ACL 2013/02 collection, the other ("re-segmented") the word-resegmented version of these articles, obtained using nematus, a seq2seq neural model used for machine translation. The motivation for the work was that spurious spaces in the text seemed to be very common, particularly in older papers, obtained by OCR-ing scanned papers.

Identifier
DOI https://doi.org/10.11588/data/VK99LU
Related Identifier https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/ACL_corrected/lrec2018_correction-ocr-word.pdf
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/VK99LU
Provenance
Creator Nastase, Vivi; Hitschler, Julian
Publisher heiDATA
Contributor Nastase, Vivi
Publication Year 2019
Rights info:eu-repo/semantics/openAccess
OpenAccess true
Contact Nastase, Vivi (Department of Computational Linguistics, Heidelberg University, Germany)
Representation
Resource Type textual data; Dataset
Format application/gzip; text/plain; charset=US-ASCII
Size 389091782; 782
Version 1.1
Discipline Humanities
Spatial Coverage Heidelberg University