Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data - Dataset

Dataset

Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data

DOI PID

With this collection of code and configuration files (contained in "LMIF" = 'Learned Metric Index Framework'), outputs ("output-files") and datasets ("datasets") we set out to explore whether a learned approach to building a metric index is a viable alternative to the traditional way of constructing metric indexes. Specifically, we build the index as a series of interconnected machine learning models. This collection serves as the basis for the reproducibility paper accompanying our parent paper -- "Learned metric index—proposition of learned indexing for unstructured data" [1].1. In "datasets" we make publicly available a collection of 3 individual dataset descriptors -- CoPhIR (1 million objects, 282 columns), Profimedia (1 million objects, 4096 columns), and MoCap (~350k objects, 4096 columns), "labels" obtained from a template index -- M-tree or M-index, "queries" used to perform an experimental search with and "ground-truths" to evaluate the approximate k-NN performance of the index. Within "test" we include dummy data to ease the integration of any custom dataset (examples in "LMIF/*.ipynb") that a reader may want to integrate into our solution. In CoPhIR [2], each of the vectors is obtained by concatenating five MPEG-7 global visual descriptors extracted from an image downloaded from Flickr. The Profimedia image dataset [3], contains Caffe visual descriptors extracted from Photo-stock images by a convolutional neural network. MoCap (motion capture data) [4] descriptors contain sequences of 3D skeleton poses extracted from 3+ hrs of recordings capturing actors performing more than 70 different motion scenarios.The dataset's size is 43 GB upon decompression.[1] Antol, Matej, et al. "Learned metric index—proposition of learned indexing for unstructured data." Information Systems 100 (2021): 101774.[2] Batko, Michal, et al. "Building a web-scale image similarity search system." Multimedia Tools and Applications 47.3 (2010): 599-629.[3] Budikova, Petra et al. "Evaluation platform for content-based image retrieval systems." International Conference on Theory and Practice of Digital Libraries. Springer, Berlin, Heidelberg, 2011.[4] Müller, Meinard, et al. "Documentation mocap database hdm05." (2007).2. "LMIF" contains a user-friendly environment to reproduce the experiments in [1]. LMIF consists of three components:- an implementation of the Learned Metric Index (distributed under the MIT license),- a collection of scripts and configuration setups necessary for re-running the experiments in [1] and- instructions for creating the reproducibility environment (Docker).For a thorough description of "LMIF", please refer to our reproducibility paper -- "Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data".3. "output-files" contain the reproduced outputs for each experiment, with generated figures and a concise ".html" report (as presented in [1])

THIS DATASET IS ARCHIVED AT DANS/EASY, BUT NOT ACCESSIBLE HERE. TO VIEW A LIST OF FILES AND ACCESS THE FILES IN THIS DATASET CLICK ON THE DOI-LINK ABOVE

Identifier
DOI	https://doi.org/10.17632/8wp73zxr47.6
PID	https://nbn-resolving.org/urn:nbn:nl:ui:13-do-nj6k
Metadata Access	https://easy.dans.knaw.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:easy.dans.knaw.nl:easy-dataset:256222

Provenance
Creator	Slanináková, T
Publisher	Data Archiving and Networked Services (DANS)
Contributor	Terézia Slanináková
Publication Year	2022
Rights	info:eu-repo/semantics/openAccess; License: http://opensource.org/licenses/MIT; http://opensource.org/licenses/MIT
OpenAccess	true

Representation
Resource Type	Dataset
Discipline	Other