Replication Data for: A Study on the Conceptual Structure of the Use of Prepositions in the Complement of Goal-oriented Motion Verbs in Brazilian Portuguese

DOI

This dataset consists in replication data for the study “A Study on the Conceptual Structure of the Use of Prepositions in the Complement of Goal-oriented Motion Verbs in Brazilian Portuguese”, in which based on a usage-feature analysis and using corpus-based and multivariate statistical methods, we analyze the use of the prepositions a ‘at’, para ‘to’, and em ‘in’ to introduce the complement of ir ‘to go’, vir ‘to come’ and chegar ‘to arrive’ in BP. The results show that there is a tendency for the use of a ‘at’ in the most formal and monitored register. The factors ‘profiling’ and ‘verb’ are the most important language-internal predictors. Action and neutral profiled events are more associated with the use of para ‘to’, while locative profiled events are more associated with the use of em ‘in’. The verb chegar ‘to arrive’ is more associated with the use of em ‘in’. We highlight that (i) the variation investigated has a cognitive basis, in addition to the linguistic and extralinguistic acting factors pointed out by previous studies, and (ii) the variation of prepositions conveys alternative construals; thus, the very high frequency of em ‘in’ next to the goal-oriented motion verbs indicates nuances of meaning motivated by the superimposition of image schemas and the cognitive operation of profiling. This dataset consists of 459 occurrences of goal-oriented motion verbs in Brazilian Portuguese (ir ‘to go’, vir ‘to come’ or chegar ‘to arrive’) manually annotated according to a set of linguistic, social and cognitive factors. Data were extracted from four BP corpora: (i) C-Oral-Brasil (263,000 words), which includes spontaneous oral language transcripts; (ii) Blogs_Foruns (263,772 words), which includes BP forums from written informal language; (iii) TecEM (234,717 words), which includes texts written by teenagers students during their BP classes in high school; and (iv) Corpus Brasileiro (CB), which consists of only texts classified as journalistic (250,700,829 words) and includes texts from Brazilian newspapers. The archive contains data in an Unicode-encoded text file (Dataset_Motion_Verb_Prep.csv), the statistical analysis script in a txt (R_script_Motion_Verbs_Prep.txt), and a Read Me data in a txt file (00_readme.txt).

Methodological information: The four subcorpora were selected to have a sample composed of occurrences with different levels of monitoring, representing a continuum from formal written texts (newspaper/CB) to spontaneous speech (C-Oral), passing through school texts (TecEM) and informal written texts (Blogs_Foruns). This study exclusively considered the constructions that followed the structure “verb (ir ‘to go’, vir ‘to come’ or chegar ‘to arrive’) + {0 up to 3 words} + preposition (a ‘at’, para ‘to’ or em ‘in’) + complement”. First, a random sample of 300 occurrences of each subcorpus was generated through a concordance search; then, the tokens that for some reason did not follow the inclusion criteria were excluded (e.g., sentences in which a was an article and not a preposition, meaning ‘the’ instead of ‘at’; sentences in which the motion verb was an auxiliar and not the main verb). The final dataset consists of 459 tokens.

C-Oral-Brasil (263,000 words) includes spontaneous Brazilian Portuguese oral language transcripts. Blogs_Foruns (263,772 words) includes Brazilian Portuguese forums from written informal language. TecEM (234,717 words) includes texts written by teenagers students during their Brazilian Portuguese classes in high school. Corpus Brasileiro is a collection of approximately one billion words of Brazilian Portuguese. C-Oral-Brasil, Blogs_Foruns and TecEM are from the 2010s, and CB, which is presented as a contemporaneous BP corpus, started to be compiled in the 2000s.

All sources are open access.

Identifier
DOI https://doi.org/10.18710/D1YHEA
Related Identifier https://doi.org/10.1163/23526416-bja10041
Metadata Access https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/D1YHEA
Provenance
Creator Gil, Maitê ORCID logo; Silva, Augusto Soares da ORCID logo
Publisher DataverseNO
Contributor Gil, Maitê; Universidade Católica Portuguesa; The Tromsø Repository of Language and Linguistics
Publication Year 2023
Rights CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Contact Gil, Maitê (Universidade do Minho)
Representation
Resource Type annotated data; Dataset
Format text/plain; text/csv
Size 8446; 123614; 3184
Version 1.0
Discipline Humanities