ACL word segmentation correction

Dataset

DOI

The data in this collection consists of two parallel directories, one ("raw") containing the raw text of 18850 articles from the ACL 2013/02 collection, the other ("re-segmented") the word-resegmented version of these articles, obtained using nematus, a seq2seq neural model used for machine translation. The motivation for the work was that spurious spaces in the text seemed to be very common, particularly in older papers, obtained by OCR-ing scanned papers.

Identifier
DOI	https://doi.org/10.11588/data/VK99LU
Related Identifier	https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/ACL_corrected/lrec2018_correction-ocr-word.pdf
Metadata Access	https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/VK99LU

Provenance
Creator	Nastase, Vivi; Hitschler, Julian
Publisher	heiDATA
Contributor	Nastase, Vivi
Publication Year	2019
Rights	info:eu-repo/semantics/openAccess
OpenAccess	true
Contact	Nastase, Vivi (Department of Computational Linguistics, Heidelberg University, Germany)

Representation
Resource Type	textual data; Dataset
Format	application/gzip; text/plain; charset=US-ASCII
Size	389091782; 782
Version	1.1
Discipline	Humanities
Spatial Coverage	Heidelberg University