Japanese web corpus with difficulty levels jpWaC-L 1.0


The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the Japanese Language Proficiency Test Content Specifications (2004). The difficulty level of the sentences is computed using various heuristics, based on the (difficulty level of) words, sentence length, etc. The corpus was collected from the Web using WaCkY tools, part-of-speech tagged and lemmatised with Chasen. The Japanese Chasen tags have also been converted to English language based tags.

The corpora are made available in vertical format. Structural attributes are and (sentence). Each text gives its @url and @domain. Sentences have the @level attribute, which describes their difficulty level. The positional attributes are: 1. token, as it appears in the text 2. lemma of the word 3. Chasen tag, translated to English 4. original Chasen tag in Japanese 5. difficulty level of the word.

The complete corpus is also split into sub-corpora of sentences with the same difficulty level.

PID http://hdl.handle.net/11356/1047
Related Identifier http://nl.ijs.si/jaslo/index-en.html#jpwac
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1047
Creator Erjavec, Tomaž; Hmeljak Sangawa, Kristina; Kawamura, Yoshiko
Publisher Jožef Stefan Institute
Publication Year 2008
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Language Japanese
Resource Type corpus
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 6
Discipline Linguistics