-
ACTER (Annotated Corpora for Term Extraction Research) v1.4
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised... -
ACTER (Annotated Corpora for Term Extraction Research) v1.3
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised... -
Cleaned Polish Oscar corpus (64M lines)
Cleaned Polish Oscar corpus (part: 64M lines, 3.45 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences... -
Korpus - Wikinews
Corpus with texts on various topics from the world and technology. -
ChunkRel WS
ChunkRel-WS is a prototype service for recognition of three syntactic relations between chunks. The service may be run against plain text (input format: text), then the... -
Lilia
sample of historical texts -
Big Data language model - subword - BPE - ARPA
Big data language model based on subword units, based on byte pair encoding in ARPA format -
XLM-RoBERTa-LARGE events relation recognition
A set of basic language tools for the Polish language. Z4.2a Improving the quality of recognition of relations between events using Transformer-type deep networks. -
Blogi_zip 02
blogi zip -
Wiki test - 34 categories
Wikipedia, 34 kategorie - zbiór do testów klasyfikatora -
PELCRA PARL corpus
The corpus comprises 50 sampled recordings (12 hours) and manual transcriptions (ca. 101 00 word tokens) of parliamentary data. -
Assamese Root Words
This list comprises of Assamese root words. Size of the Assamese Root Word List is 15,750 words These Assamese NLP resources including the Tools and Applications are... -
Enriched corpus of [Polish] frequency dictionary
Wzbogacony korpus slownika frekwencyjnego, cf. http://clip.ipipan.waw.pl/PL196x?action=AttachFile&do=view&target=wksf.pdf -
Cyfry
A small spoken digits corpus in polish. Contains 488 recordings of 25 speakers reading 20 digits (0-9) each. Amounts to around 76 minutes of recordings. Split into train (~72%),... -
Polimorf
PoliMorf is a morphological dictionary for Polish resulting from the standardization and merger of Morfeusz SGJP and Morfologik. The present version includes extended... -
KPWr annotation guidelines - phrase lemmatization
Annotation guidelines for manual phrase lemmatisation in KPWr (Polish Corpus of Wrocław University of Technology). -
Wizerunek Andreja Babiša i Mateusza Morawieckiego w kontekście sytuacji kryzy...
Zbiór artykułów z prasy czeskiej dotyczący Mateusza Morawickiegi (iDnes) oraz z prasy polskiej dotyczących Andreja Babiša (Rzeczpospolita) -
Świgra — a parser of Polish
Świgra is a parser of Polish generating constituency trees using a DCG style grammar stemming from Marek Świdziński’s grammar “Gramatyka formalna języka polskiego” (1992). The... -
MWELexicon
Lexicon of 55k multi-word lexical units linked to plWordNet, together with description of their syntactic bahaviour obtained in constraint language (WCCL). -
Inforex
Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense...