Finnish web corpus fiWaC 1.0

Dataset

PID

The Finnish web corpus fiWaC was built by crawling the .fi top-level domain in 2015 for both Finnish and English documents. The corpus was naively tokenised (via spaces), near-deduplicated on paragraph level and paragraph-shuffled. Each paragraph contains metadata on the URL and language identification. The Finnish (~1.7B tokens) and English (~2B tokens) parts of the corpus are organised in separate files.

Identifier
PID	http://hdl.handle.net/11356/1074
Related Identifier	https://cordis.europa.eu/project/id/324414
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1074

Provenance
Creator	Ljubešić, Nikola; Pirinen, Tommi; Toral, Antonio
Publisher	Jožef Stefan Institute
Publication Year	2016
Funding Reference	info:eu-repo/grantAgreement/EC/FP7/324414
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Finnish; English
Resource Type	corpus
Format	application/gzip; text/plain; charset=utf-8; downloadable_files_count: 38
Discipline	Linguistics