Dataset - B2FIND

The Model latinpipe-evalatin24-240520 for LatinPipe 2024

The latinpipe-evalatin24-240520 is a PhilBerta-based model for LatinPipe 2024 https://github.com/ufal/evalatin2024-latinpipe, performing tagging, lemmatization, and dependency...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

CALEM (Comprehensive Arabic LEMmas)

Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical...

UDPipe 2

UDPipe 2 is a POS tagger, lemmatizer and dependency parser. Compared to UDPipe 1: UDPipe 2 is Python-only and tested only in Linux, UDPipe 2 is meant as a research tool,...

Czech Morphological Analyzer v1

One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.

Prague Dependency Treebank 3.5

The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied...

EvaLatin 2020 models for UDPipe 2 (2020-08-31)

POS Tagger and Lemmatizer models for EvaLatin2020 data (https://github.com/CIRCSE/LT4HALA). The model documentation including performance can be found at...

KPWr annotation guidelines - phrase lemmatization

Annotation guidelines for manual phrase lemmatisation in KPWr (Polish Corpus of Wrocław University of Technology).

PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...

The task consists in developing a tool for the lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines...

KPWr annotation guidelines - named entity and phrase lemmatization 2.0

Guidelines for named entity and multi-word phrase lemmatization used in in KPWr (Polish Corpus of Wrocław University of Technology).

ENIAMtoolkit (2017-03-06)

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences; - LCG...

KPWr dump r240

Dump of the Polish Corpus of Wrocław University of Technology (KPWr) containing a set of documents annotated with named entities and keywords.

PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...

The task consists in developing a tool for lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines...

ENIAMtoolkit

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences.

Beserman multimedia corpus

Beserman multimedia corpus This deposit contains transcriptions of monologues and conversations in spoken Beserman (formerly classified as a dialect of Udmurt, ISO 639-2 code...

35 datasets found