Combining text and vision in compound semantics: Towards a cognitively plausible multimodal model

In the current state-of-the art distributionalsemantics model of the meaning of noun-noun compounds (such aschainsaw, but-terfly, home phone),CAOSS(Marelli et al.2017), the semantic vectors of the individ-ual constituents are combined, and enrichedby position-specific information for each con-stituent in its role as either modifier or head. Most recently there have been attempts to in-clude vision-based embeddings in these mod-els (G ̈unther et al., 2020b), using the linear ar-chitecture implemented in theCAOSSmodel.In the present paper, we extend this line ofresearch and demonstrate that moving to non-linear models improves the results for visionwhile linear models are a good choice for text.Simply concatenating text and vision vectorsdoes not currently (yet) improve the predictionof human behavioral data over models usingtext- and vision-based measures separately.

Identifier
DOI https://doi.org/10.17026/dans-xdp-3qhj
PID https://nbn-resolving.org/urn:nbn:nl:ui:13-gb-vwz6
Metadata Access https://easy.dans.knaw.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:easy.dans.knaw.nl:easy-dataset:219229
Provenance
Creator Gupta, ABHIJEET
Publisher Data Archiving and Networked Services (DANS)
Contributor Gupta, ABHIJEET; Dr Abhijeet Gupta (Heinrich-Heine-Universität Düsseldorf )
Publication Year 2021
Rights info:eu-repo/semantics/openAccess; License: http://creativecommons.org/publicdomain/zero/1.0; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Representation
Language English
Resource Type Dataset
Format text/plain
Discipline Other