Words Algorithm Collection : finding closely related open access books using text mining techniques

Dataset

DOI

Open access platforms and retail websites have one thing in common: they are trying to present the most relevant offerings possible to their patrons. Retail websites – such as Amazon.com – de-ploy recommender systems based on data collected about their customers. These systems im-prove with the amount of data available: the more is known about the customers, the better it can predict what other merchandise will appeal.Recommender systems are successful, but using open access platforms to track people is not ac-ceptable. Therefore, a different solution is needed. Compared to retail websites, open access plat-forms have an unique advantage: they are able to use the complete contents of the publications they host. So, the question arises if it is possible to create a recommender system based on the contents of freely available documents, instead of personal data.The solution described in this paper is based on standard open source software. It is built using a combination of DSpace 6 and the R programming language. The open access platform – based on DSpace 6 – is the OAPEN Library; the data set used consists of nearly 11,000 open access books and chapters. The OAPEN Library enables data extraction through an API (application programming in-terface). A text mining algorithm written in the R programming language uses the full text of the publications and filters out the most common combinations of three words (trigrams). The next step is finding the publications that have one of more trigrams in common. The more trigrams two books or chapters share, the more closer they are 'connected'. This allows us not just to find relat-ed titles for each publication, but also to quantify how closely they are connected.

Date: 2021-02-24

Date Submitted: 2021-02-24

Identifier
DOI	https://doi.org/10.17026/dans-xbm-qr5e
Metadata Access	https://phys-techsciences.datastations.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.17026/dans-xbm-qr5e

Provenance
Creator	R. Snijder
Publisher	DANS Data Station Phys-Tech Sciences
Contributor	R. Snijder
Publication Year	2021
Rights	CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	R. Snijder (OAPEN Foundation)

Representation
Resource Type	Dataset
Format	text/csv; text/plain; application/vnd.oasis.opendocument.spreadsheet; application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; application/zip
Size	244483; 18862; 28269947; 16859325; 30154; 17377
Version	2.0
Discipline	Other