SDOstreamclust Evaluation Tests
conducted for the paper: Stream Clustering Robust to Concept Drift
Context and methodology
SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift
In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.
This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.
Docker
A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust
Technical details
Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.
[data] contains datasets in ARFF format.
[results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).
"dependencies.sh" lists and installs python dependencies.
"pysdoclust-stream-main.zip" contains the SDOstreamclust python package.
"README.md" shows details and intructions to use this repository.
"run.sh" runs the complete experiments.
"run_comp.py"for running experiments specified by arguments.
"TSindex.py" implements functions for the Temporal Silhouette index.
Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.
License
The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.