OK, Computer, what are these books about? - data files

DOI

The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked to Wikipedia pages. Based on the algorithm, new keywords are added to the book descriptions, plus a list of relevant Wikipedia pages.For this experiment, the full text of 4125 books and chapters – available in the OAPEN Library – is scanned, resulting in a data file of over 25 million entries. In other words, on average the algorithm found roughly 6,100 ‘hits’ for each publication. When only the most common terms per publication are selected, does this result in a useful description of its content?The data file OK_Computer_results contains a list of open access books and chapters descriptions found in the OAPEN Library, combined with Wikipedia entries found using the entity-fishing algorithm, plus several actions to filter out only the terms which describe the publication best. Each book or chapter is available in the OAPEN Library (www.oapen.org), see the column HANDLE/The data file nerd_oapen_response_database contains the complete data set. The other text files contain R code to manipulate the file nerd_oapen_response_database.Description of nerd_oapen_response_database:The data is divided into the following columns:Data DescriptionOAPEN_ID Unique ID of the publication in the OAPEN LibraryrawName The entity as it appears in the textnerd_score Disambiguation confidence scorenerd_selection_score Selection confidence score, indicates how certain the disambiguated entity is actually valid for the text mentionwikipediaExternalRef ID of the Wikipedia pagewiki_URL URL of the Wikipedia pagetype NER class of the entitydomains Description of subject domainEach book may contain more than one occurrence of the same entity. The nerd_score and the nerd_selection_score may vary. This allows researchers to count the number of occurrences and use this as an additional method to assess the contents of the book. The OAPEN_ID refers to the identifier of the title in the OAPEN Library.For more information about the entity-fishing query processing service see https://nerd.readthedocs.io/en/latest/restAPI.html#response.

Date: 2020-06-03

Identifier
DOI https://doi.org/10.17026/dans-2z4-mrgm
Metadata Access https://ssh.datastations.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.17026/dans-2z4-mrgm
Provenance
Creator R Snijder ORCID logo
Publisher DANS Data Station Social Sciences and Humanities
Contributor R. Snijder
Publication Year 2020
Rights CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Contact R. Snijder (OAPEN Foundation)
Representation
Resource Type Dataset
Format text/plain; application/zip; text/csv
Size 1677; 2227; 2798; 18922; 2586236965; 10416908
Version 2.1
Discipline Humanities