Cleaned Polish Oscar corpus (64M lines)

Dataset

PID

Cleaned Polish Oscar corpus (part: 64M lines, 3.45 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added

Identifier
PID	http://hdl.handle.net/11321/843
Related Identifier	https://github.com/Ermlab/PoLitBert/
Metadata Access	https://clarin-pl.eu/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:clarin-pl.eu:11321/843

Provenance
Creator	Sopyła, Krzysztof
Publisher	Ermlab
Publication Year	2021
OpenAccess	true
Contact	clarin-pl(at)pwr.edu.pl

Representation
Language	Polish
Resource Type	corpus
Format	downloadable_files_count: 0
Discipline	Linguistics