Replication Data for: "The category of throw verbs as productive source of the Spanish inchoative construction."

DOI

The dataset contains the quantitative data used to create the tables and graphics in the article "The category of throw verbs as productive source of the Spanish inchoative construction."

The data from the 21th century originates from the Spanish Web Corpus (esTenTen18), accessed via Sketch Engine. Only the subcorpus for European Spanish Data was selected. After downloading, the samples were manually cleaned. In the dataset, maximally 500 tokens were retained per auxiliary. For the earlier centuries, the data was extracted from the Corpus Diacrónico del Español (Corde). See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific corpus queries that were used.

The data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'Century', 'INF' (infinitive) and 'Class' were used as input for the analysis (see data-specific sections below for more information about the variables).

The empirical analysis is based on the downloaded data from the Spanish Web corpus (esTenTen18) (Kilgariff & Renau 2013). The Spanish Web corpus contains 20.3 billion words, from which 3.5 billion belong to the European Spanish domain. This corpus contains internet data, with observations originating from fora, blogs, Wikipedia, etc. Only the subcorpus with European Spanish data was consulted. The search syntax that was used to detect the inchoative construction was the following: “[lemma="echar"] [tag="R."]{0,3}"a"[tag="V."] within ” (consult Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for all corpus queries). After downloading, all the observations were manually cleaned. In total, the dataset contains, after the removal of false positives, 5514 tokens with a maximum of 500 tokens per auxiliary. False positive tokens were, for example, tagging errors wrongly coding nouns, such as Superman, Pokémon, Irán, among others, as infinitives, and also observations in which the auxiliary in combination with the infinitive did not express the inchoative value but its orginal semantic meaning, such as "saltar a nadar", for example, which means “to jump to swim” and not “to start to swim”. Of the auxiliaries with less than 500 relevant tokens in the esTenTen corpus, all tokens in the dataset were retained; for the auxiliaries with more than 500 tokens in the esTenTen corpus, only the first 500 were selected.

For this specific study on the throw verbs, only the following auxilaries were retained: arrojar, disparar, echar, lanzar and tirar. For the diachronic data, the Corpus Diacrónico del Español (CORDE) was consulted. See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific queries that were used to retrieve the data in CORDE.

Identifier
DOI https://doi.org/10.18710/TR2PWJ
Related Identifier https://doi.org/10.61430/FMWR9351
Metadata Access https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/TR2PWJ
Provenance
Creator Van Hulle, Sven ORCID logo; Enghels, Renata ORCID logo
Publisher DataverseNO
Contributor Van Hulle, Sven; Ghent University; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year 2024
Rights info:eu-repo/semantics/openAccess
OpenAccess true
Contact Van Hulle, Sven (Ghent University)
Representation
Resource Type Corpus data; Dataset
Format text/plain; text/comma-separated-values
Size 6578; 84948; 1445
Version 1.0
Discipline Humanities
Spatial Coverage Ghent University