C4Corpus (publicdomain part) - Dataset - B2FIND

Dataset

C4Corpus (publicdomain part)

PID

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.

Identifier
PID	http://hdl.handle.net/11372/LRT-2209
Related Identifier	http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf
Related Identifier	https://dkpro.github.io/dkpro-c4corpus/
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11372/LRT-2209

Provenance
Creator	Gurevych, Iryna; Habernal, Ivan; Zayed, Omnia
Publisher	Technische Universität Darmstadt
Publication Year	2016
Rights	Public Domain Mark (PD); http://creativecommons.org/publicdomain/mark/1.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Afrikaans; Arabic; Bulgarian; Czech; Danish; German; Greek, Modern (1453-); Greek; English; Estonian; Persian; Farsi; Finnish; French; Croatian; Hungarian; Indonesian; Italian; Japanese; Korean; Latvian; Lithuanian; Dutch; Flemish; Norwegian; Polish; Portuguese; Russian; Slovenian; Slovene; Somali; Spanish; Castilian; Swahili; Swedish; Tagalog; Thai; Turkish; Ukrainian; Undetermined; Vietnamese
Resource Type	corpus
Format	application/x-gzip; text/plain; downloadable_files_count: 36
Discipline	Linguistics