OK, Computer, what are these books about? - data files - Dataset

Dataset

OK, Computer, what are these books about? - data files

DOI

The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked to Wikipedia pages. Based on the algorithm, new keywords are added to the book descriptions, plus a list of relevant Wikipedia pages.For this experiment, the full text of 4125 books and chapters – available in the OAPEN Library – is scanned, resulting in a data file of over 25 million entries. In other words, on average the algorithm found roughly 6,100 ‘hits’ for each publication. When only the most common terms per publication are selected, does this result in a useful description of its content?The data file OK_Computer_results contains a list of open access books and chapters descriptions found in the OAPEN Library, combined with Wikipedia entries found using the entity-fishing algorithm, plus several actions to filter out only the terms which describe the publication best. Each book or chapter is available in the OAPEN Library (www.oapen.org), see the column HANDLE/The data file nerd_oapen_response_database contains the complete data set. The other text files contain R code to manipulate the file nerd_oapen_response_database.Description of nerd_oapen_response_database:The data is divided into the following columns:Data DescriptionOAPEN_ID Unique ID of the publication in the OAPEN LibraryrawName The entity as it appears in the textnerd_score Disambiguation confidence scorenerd_selection_score Selection confidence score, indicates how certain the disambiguated entity is actually valid for the text mentionwikipediaExternalRef ID of the Wikipedia pagewiki_URL URL of the Wikipedia pagetype NER class of the entitydomains Description of subject domainEach book may contain more than one occurrence of the same entity. The nerd_score and the nerd_selection_score may vary. This allows researchers to count the number of occurrences and use this as an additional method to assess the contents of the book. The OAPEN_ID refers to the identifier of the title in the OAPEN Library.For more information about the entity-fishing query processing service see https://nerd.readthedocs.io/en/latest/restAPI.html#response.

Date: 2020-06-03

Identifier
DOI	https://doi.org/10.17026/dans-2z4-mrgm
Metadata Access	https://ssh.datastations.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.17026/dans-2z4-mrgm

Provenance
Creator	R Snijder
Publisher	DANS Data Station Social Sciences and Humanities
Contributor	R. Snijder
Publication Year	2020
Rights	CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	R. Snijder (OAPEN Foundation)

Representation
Resource Type	Dataset
Format	text/plain; application/zip; text/csv
Size	1677; 2227; 2798; 18922; 2586236965; 10416908
Version	2.1
Discipline	Humanities