Cleaned Polish Oscar corpus (64M lines)

PID

Cleaned Polish Oscar corpus (part: 64M lines, 3.45 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added

Identifier
PID http://hdl.handle.net/11321/843
Related Identifier https://github.com/Ermlab/PoLitBert/
Metadata Access https://clarin-pl.eu/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:clarin-pl.eu:11321/843
Provenance
Creator Sopyła, Krzysztof
Publisher Ermlab
Publication Year 2021
OpenAccess true
Contact clarin-pl(at)pwr.edu.pl
Representation
Language Polish
Resource Type corpus
Format downloadable_files_count: 0
Discipline Linguistics