Combining text and vision in compound semantics: Towards a cognitively plausible multimodal model

DOI

In the current state-of-the art distributionalsemantics model of the meaning of noun-noun compounds (such aschainsaw, but-terfly, home phone),CAOSS(Marelli et al.2017), the semantic vectors of the individ-ual constituents are combined, and enrichedby position-specific information for each con-stituent in its role as either modifier or head. Most recently there have been attempts to in-clude vision-based embeddings in these mod-els (G ̀ˆunther et al., 2020b), using the linear ar-chitecture implemented in theCAOSSmodel.In the present paper, we extend this line ofresearch and demonstrate that moving to non-linear models improves the results for visionwhile linear models are a good choice for text.Simply concatenating text and vision vectorsdoes not currently (yet) improve the predictionof human behavioral data over models usingtext- and vision-based measures separately.

Date Submitted: 2021-08-15

Identifier
DOI https://doi.org/10.17026/dans-xdp-3qhj
Metadata Access https://phys-techsciences.datastations.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.17026/dans-xdp-3qhj
Provenance
Creator ABHIJEET Gupta
Publisher DANS Data Station Phys-Tech Sciences
Contributor Abhijeet Gupta
Publication Year 2021
Rights CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Contact Abhijeet Gupta
Representation
Resource Type Dataset
Format text/plain; application/zip
Size 7138; 13401; 138001; 8066
Version 1.0
Discipline Other