SemFi: Finnish Semantics with Syntactic Relations

Context This dataset is covered in detail in the following publication:

Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)

Content This SQLite database (available in contains Finnish lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the corpus. Also, the frequency of a syntactic relation between two words is recorded. This means that it is possible to see how frequently for example koira (dog) has appeared with a subject relation with haukkua (bark).

Feel free to dive into the database, but remember that an easy programmatic interface to the data is provided in UralicNLP.

The data used to build the SQLite database is also available in JSON format. The files are divided according to part-of-speech. For example, A.json has a dictionary of adjectives. Under each adjective, there is a dictionary of syntactic relations. In each syntactic relation you will find the part-of-speech tags of the words that have been linked to the top-level word with a given syntactic relation. Under each part-of-speech, you will find the related words and the frequency of the relation between the two words.

Notice that compounds may have a | to indicate a word boundary.

Inspiration This data has been used to generate poems in Finnish [1] and visualize word usage during a dictionary editing process [2].

[1] Hämäläinen, M. (2018). Harnessing NLG to Create Finnish Poetry Automatically. In F. Pachet, A. Jordanous, & C. León (Eds.), Proceedings of the Ninth International Conference on Computational Creativity (pp. 9-15). Salamanca: Association for Computational Creativity (ACC). [2] Hämäläinen, M., & Rueter, J. (2019). An Open Online Dictionary for Endangered Uralic Languages. In I. Kosem, T. Zingano Kuhn, M. Correia, J. P. Ferreira , M. Jansen , I. Pereira, J. Kallas, M. Jakubíček, S. Krek, … C. Tiberius (Eds.), Electronic lexicography in the 21st century: Proceedings of the eLex 2019 conference (pp. 819-830). (Electronic lexicography in the 21st century). Brno: Lexical Computing CZ s.r.o..

Metadata Access
Creator Hämäläinen, Mika
Publisher CLARIN
Publication Year 2020
Rights info:eu-repo/semantics/openAccess; CC BY 4.0
OpenAccess true
Discipline Linguistics