The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the Japanese Language Proficiency Test Content Specifications (2004). The difficulty level of the sentences is computed using various heuristics, based on the (difficulty level of) words, sentence length, etc. We distinguish 5 difficulty levels, from L0 (very difficult) to L4 (very easy).
The corpus was collected from the Web using WaCkY tools, part-of-speech tagged and lemmatised with Chasen. The Japanese Chasen tags have also been converted to English language based tags.
The corpora are made available in vertical format. Structural attributes are and (sentence). Each text gives its @url and @domain. Sentences have the @level attribute, which describes their difficulty level. The positional attributes are:
1. token, as it appears in the text
2. lemma of the word
3. Chasen tag, translated to English
4. original Chasen tag in Japanese
5. difficulty level of the word.
The complete corpus is also split into sub-corpora of sentences with the same difficulty level.