SDOclust Evaluation Tests
conducted for the paper: SDOclust: Clustering with Sparse Data Observers
Context and methodology
SDOclust is a clustering extension of the Sparse Data Observers (SDO) algorithm. SDOclust uses data observers as graph nodes and cluster them considering connected components and local thresholding. Observers' labels are subsequently propagated to data points.
In this repository, SDOclust is evaluated with 15 two-dimensional synthetic datasets, 138 multi-dimensional synthetic datasets, and 2 real-application datasets, and compared with HDBSCAN and k-means-- algorithms.
This repository is framed within the research on the following domains: algorithm evaluation, clustering, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further clustering evaluation and comparison.
Technical details
Experiments are conducted in Python 3. The file and folder structure is as follows:
[data2d] contains 15 two-dimensional datasets as CSV files (last column is the label).
[dataMd] contains 138 multi-dimensional datasets as CSV files (last column is the label).
[dataReal] contains 2 real/application datases as CSV files (last column is the label).
[plots] contains plots (.png, .pdf) with results generated by test scripts.
[tables] contains tables (.csv, .tex) with results generated by test scripts.
[cddiag] contain scripts for generating critical difference diagrams with Wilcoxon signed rank tests and plots from conducted tests.
"dependencies.py" installs required python packages.
"tests_2d.py" runs 2d experiments.
"tests_Md.py" runs multi-dimensional experiments.
"test_mawi.py" runs experiments with real network traffic data from MAWI captures.
"test_sirena.py" runs experiments with real electricity consumption data from the Sirena project.
"sdo.py" implements sdoclust functions.
"pamse2d.py" runs sensitivity analysis on SDOclust parameters.
"update_test.py" shows an example of SDOclust working in update modus,
"gbc.py" contains functions for the graph-based clustering implementation (based on https://github.com/dayyass/graph-based-clustering).
"kmeansmm.py" is the k-means-- implementation (based on https://github.com/Strizzo/kmeans--).
"LICENSE" file.
"README.md" for further details, link to sources and instructions for reproducibility.
License
The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GNU GPL license.