In the current state-of-the art distributionalsemantics model of the meaning of noun-noun compounds (such aschainsaw, but-terfly, home phone),CAOSS(Marelli et al.2017), the semantic vectors of the individ-ual constituents are combined, and enrichedby position-specific information for each con-stituent in its role as either modifier or head. Most recently there have been attempts to in-clude vision-based embeddings in these mod-els (G ̀ˆunther et al., 2020b), using the linear ar-chitecture implemented in theCAOSSmodel.In the present paper, we extend this line ofresearch and demonstrate that moving to non-linear models improves the results for visionwhile linear models are a good choice for text.Simply concatenating text and vision vectorsdoes not currently (yet) improve the predictionof human behavioral data over models usingtext- and vision-based measures separately.
Date Submitted: 2021-08-15