ORCAS-I

Dataset

DOI

ORCAS-I is an annotated version of ORCAS dataset (Craswell et al., 2020) annotated with user intents using weak supervision. It allows you to train your algorithm on various types of user intents. Those intents are initially taken from Broder's (2002) classification: informational, navigational and transactional. We also refined this classification and added two subcategories inside the informational category: factual and instrumental. If the intent did not get any label inside the informational category it was classified as abstain. ORCAS-I consists of the following files: a) ORCAS-I-18M.tsv A complete ORCAS data set which contains 18 million unique query-urls pairs.

dataset size: 18,823,602unique queries: 10,405,339unique URLs: 1,422,029unique domains: 241,199 b) ORCAS-I-2M.tsv A 2M subset of ORCAS-I-18M.tsv that we used for our experiments with different machine learning algorithms.

dataset size: 2,000,000unique queries: 1,796,652unique URLs: 618,679unique domains: 126,001 Both ORCAS-I-18M and ORCAS-I-2M contain the following columns:

qid: the id of the query query: the text of the query url: the url that the user clicked did: the document from TREC deep learning track that the url leads to level_1: first level of annotation which has three top level categories: informational, navigational and transactional level_2: second level of annotation (only classifies according to factual and instrumental categories, so all the other intents in the column are classified as abstain) label: final intent label. Provides the annotation for informational, navigational and transactional categories and also for factual, instrumental and abstain subcategories data_split: either 'train' or 'validation' that corresponds to split used during the original experiments You can train your classifier either on the 3 top level categories (column 'level_1') or on the full taxonomy (column 'label'). c) ORCAS-I-gold.tsv This is a test file that contains 1000 randomly selected queries from the full dataset (they are excluded from the 2M sample). These queries were manually annotated by two IR specialists.

dataset size: 1,000unique queries: 1,000unique URLs: 995unique domains: 700 ORCAS-I-gold contains the following columns:

qid: the id of the query query: the text of the query url: the url that the user clicked did: the document from TREC deep learning track that the url leads to label_manual - manually annotated intent data_split: always equal to 'test'

Identifier
DOI	https://doi.org/10.48436/pp7xz-n9a06
Related Identifier	IsSupplementTo https://doi.org/10.1145/3477495.3531737
Related Identifier	Continues https://doi.org/10.1145/792550.792552
Related Identifier	IsVersionOf https://doi.org/10.48436/16vvs-8ew70
Metadata Access	https://researchdata.tuwien.ac.at/oai2d?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:researchdata.tuwien.ac.at:pp7xz-n9a06

Provenance
Creator	Kusa, Wojciech ; Alexander, Daria ; de Vries, Arjen P.
Publisher	TU Wien
Publication Year	2022
Rights	Creative Commons Attribution 4.0 International; https://creativecommons.org/licenses/by/4.0/legalcode
OpenAccess	true
Contact	tudata(at)tuwien.ac.at

Representation
Language	English
Resource Type	Dataset
Version	1.0.0
Discipline	Other