Temporal Silhouette for Stream Clustering Validation - Evaluation Tests
conducted for the paper: Temporal Silhouette: Validation of Stream Clustering Robust to Concept Drift
Context and methodology
The Temporal Silhouette (TS) is an index for the internal validation of stream clustering that is robust and consistent in the event of concept drift and different types of outliers. TS is based on the well-known Silhouette index (Rousseeuw, 1987).
In this repository, TS is compared with 3 popular CVIs (Silhouette, Davies-Bouldin, Calinski-Harabasz) and 3 iCVIs (incremental Xie-Beni index, incremental Partition Separation index, incremental representative Cross Information Potential) when evaluating performances of 4 stream clustering algorithms (CluStream, DenStream, BIRCH and StreamKMeans++). Different data scenarios are used: 2 real-life cases, 4 stationary popular datasets for clustering evaluation submitted to 32 different forms-levels of degradation, and 200 synthetic scenarios that implement different types of concept drift identified in the literature, as well as spatial and temporal outliers.
This repository is framed within the research on the following domains: algorithm evaluation, streaming data analysis, stream clustering, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further clustering evaluation and comparison.
References
Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Comput and Applied Mathematics 20:53–65
Technical details
Experiments are conducted in Python 3. The file and folder structure is as follows:
[dataR] contains 4 datasets obtained from real data.
[dataS] contains 80 synthetic datasets for concept drift tests.
[dataT] contains 4 datasets for stationary tests
[plots] contains plots results generated by test scripts.
[results] contains tables with results generated by test scripts.
[utils] contains utilities for transforming data and plotting results.
"dependencies.py" installs required python packages.
"LICENSE" file.
"README.md" for further details, link to sources and instructions for reproducibility.
"run_analysis_real.py" runs experiments with stream clustering and real data.
"run_analysis_synthetic.py" runs experiments with stream clustering and synthetic data submitted to concept drift.
"run_stationary.py" runs experiments with stationary data submitted to different perturbations.
"run_TS_stability.py" runs sensitivity analysis on TS w and k parameters.
"toy_tests.py" shows some simple examples of TS main cases of concept drift.
"TSindex.py" implements and provides TS functions.
License
The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GNU GPL license.
Note
This version replaces and makes obsolete:
Iglesias Vázquez, Felix (2023). py-temporal-silhouette-main.zip. figshare. Conference contribution. https://doi.org/10.6084/m9.figshare.22149854.v1