Phishing Websites Dataset

The dataset consists of a collection of legitimate as well as phishing website instances. Each instance contains the URL and the relevant HTML page. The index.sql file is the root file, and it can be used to map the URLs with the relevant HTML pages. The dataset can serve as an input for the machine learning process.Highlights: - Total number of instances: 80,000 (83,275 instances in the dataset due to the existence of some removed SQL records in preprocessing stage) - Number of legitimate website instances (labelled as 0 in the SQL file): 50,000 - Number of phishing website instances (labelled as 1 in the SQL file): 30,000Structure:The index.sql file is the root file. It consisted of five fields. 1). rec_id - record number 2). url - URL of the webpage 3). website - Filename of the webpage (i.e. 1635698138155948.html) 4). result - Indicates whether a given URL is phishing or not (0 for legitimate and 1 for phishing). 5). created_date - Webpage downloaded dateSources: - Legitimate Data [50,000] - These data were collected from two sources. 1). Google search - Simple keyword search on the google search engine was used, and the top 5 URLs of each search were collected. Domain restrictions were used and limited a maximum of 10 collections from a domain to have a diverse collection at the end. 2). Ebbu2017 Phishing Dataset [1] - Nearly 25,874 active URLs were collected from this repository - Phishing Data [30,000] - Three sources were used. 1). PhishTank - From 01 December 2020 to 31 October 2021 2). OpenPhish - From 29 September 2021 to 31 October 2021 3). PhishRepo [2] - From 29 September 2021 to 31 October 2021Data Collection Process: - Legitimate Data: - The URLs were collected from the above sources and fetched the relevant webpages separately. - The URLs are in different lengths to minimize the URL lengths issue mentioned by Verma et al. [3]. - Phishing Data: - The URLs were collected from the above sources, and at the same time, the relevant web pages were fetched. - An automated script continuously monitored PhishTank and OpenPhish to collect the latest phishing URLs. - The collected URLs were fetched simultaneously to minimize the resource unavailable issue since the phishing pages do not exist for a longer period on the web. - PhishRepo provides all the resources relevant to a phishing webpage; therefore, simply use their download function to download PhishRepo data.References:[1]. Ebbu2017 Phishing Dataset. Accessed 31 October 2021. Available: https://github.com/ebubekirbbr/pdd/tree/master/input.[2]. PhishRepo. Accessed 31 October 2021. Available: https://moraphishdet.projects.uom.lk/phishrepo/.[3]. Verma, Rakesh M., Victor Zeng, and Houtan Faridi. "Data quality for security challenges: Case studies of phishing, malware and intrusion detection datasets.", 2019.

THIS DATASET IS ARCHIVED AT DANS/EASY, BUT NOT ACCESSIBLE HERE. TO VIEW A LIST OF FILES AND ACCESS THE FILES IN THIS DATASET CLICK ON THE DOI-LINK ABOVE

Identifier
DOI https://doi.org/10.17632/n96ncsr5g4.1
PID https://nbn-resolving.org/urn:nbn:nl:ui:13-c9-tide
Metadata Access https://easy.dans.knaw.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:easy.dans.knaw.nl:easy-dataset:227794
Provenance
Creator Ariyadasa, S
Publisher Data Archiving and Networked Services (DANS)
Contributor Subhash Ariyadasa
Publication Year 2021
Rights info:eu-repo/semantics/openAccess; License: http://creativecommons.org/licenses/by/4.0; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Representation
Resource Type Dataset
Discipline Other