CLARIN - Repositories

ACTER (Annotated Corpora for Term Extraction Research) v1.4

The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised...

ACTER (Annotated Corpora for Term Extraction Research) v1.3

The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised...

Cleaned Polish Oscar corpus (64M lines)

Cleaned Polish Oscar corpus (part: 64M lines, 3.45 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences...

Korpus - Wikinews

Corpus with texts on various topics from the world and technology.

ChunkRel WS

ChunkRel-WS is a prototype service for recognition of three syntactic relations between chunks. The service may be run against plain text (input format: text), then the...

Lilia

sample of historical texts

Big Data language model - subword - BPE - ARPA

Big data language model based on subword units, based on byte pair encoding in ARPA format

XLM-RoBERTa-LARGE events relation recognition

A set of basic language tools for the Polish language. Z4.2a Improving the quality of recognition of relations between events using Transformer-type deep networks.

Blogi_zip 02

blogi zip

Wiki test - 34 categories

Wikipedia, 34 kategorie - zbiór do testów klasyfikatora

PELCRA PARL corpus

The corpus comprises 50 sampled recordings (12 hours) and manual transcriptions (ca. 101 00 word tokens) of parliamentary data.

Assamese Root Words

This list comprises of Assamese root words. Size of the Assamese Root Word List is 15,750 words These Assamese NLP resources including the Tools and Applications are...

Enriched corpus of [Polish] frequency dictionary

Wzbogacony korpus slownika frekwencyjnego, cf. http://clip.ipipan.waw.pl/PL196x?action=AttachFile&do=view&target=wksf.pdf

Cyfry

A small spoken digits corpus in polish. Contains 488 recordings of 25 speakers reading 20 digits (0-9) each. Amounts to around 76 minutes of recordings. Split into train (~72%),...

Polimorf

PoliMorf is a morphological dictionary for Polish resulting from the standardization and merger of Morfeusz SGJP and Morfologik. The present version includes extended...

KPWr annotation guidelines - phrase lemmatization

Annotation guidelines for manual phrase lemmatisation in KPWr (Polish Corpus of Wrocław University of Technology).

Wizerunek Andreja Babiša i Mateusza Morawieckiego w kontekście sytuacji kryzy...

Zbiór artykułów z prasy czeskiej dotyczący Mateusza Morawickiegi (iDnes) oraz z prasy polskiej dotyczących Andreja Babiša (Rzeczpospolita)

Świgra — a parser of Polish

Świgra is a parser of Polish generating constituency trees using a DCG style grammar stemming from Marek Świdziński’s grammar “Gramatyka formalna języka polskiego” (1992). The...

MWELexicon

Lexicon of 55k multi-word lexical units linked to plWordNet, together with description of their syntactic bahaviour obtained in constraint language (WCCL).

Inforex

Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense...

4,412 datasets found