Words Algorithm Collection : finding closely related open access books using text mining techniques

DOI

Open access platforms and retail websites have one thing in common: they are trying to present the most relevant offerings possible to their patrons. Retail websites – such as Amazon.com – de-ploy recommender systems based on data collected about their customers. These systems im-prove with the amount of data available: the more is known about the customers, the better it can predict what other merchandise will appeal.Recommender systems are successful, but using open access platforms to track people is not ac-ceptable. Therefore, a different solution is needed. Compared to retail websites, open access plat-forms have an unique advantage: they are able to use the complete contents of the publications they host. So, the question arises if it is possible to create a recommender system based on the contents of freely available documents, instead of personal data.The solution described in this paper is based on standard open source software. It is built using a combination of DSpace 6 and the R programming language. The open access platform – based on DSpace 6 – is the OAPEN Library; the data set used consists of nearly 11,000 open access books and chapters. The OAPEN Library enables data extraction through an API (application programming in-terface). A text mining algorithm written in the R programming language uses the full text of the publications and filters out the most common combinations of three words (trigrams). The next step is finding the publications that have one of more trigrams in common. The more trigrams two books or chapters share, the more closer they are 'connected'. This allows us not just to find relat-ed titles for each publication, but also to quantify how closely they are connected.

Date: 2021-02-24

Date Submitted: 2021-02-24

Identifier
DOI https://doi.org/10.17026/dans-xbm-qr5e
Metadata Access https://phys-techsciences.datastations.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.17026/dans-xbm-qr5e
Provenance
Creator R. Snijder ORCID logo
Publisher DANS Data Station Phys-Tech Sciences
Contributor R. Snijder
Publication Year 2021
Rights CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Contact R. Snijder (OAPEN Foundation)
Representation
Resource Type Dataset
Format text/csv; text/plain; application/vnd.oasis.opendocument.spreadsheet; application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; application/zip
Size 244483; 18862; 28269947; 16859325; 30154; 17377
Version 2.0
Discipline Other