These are the experimental data for the paper
Bach, Jakob. "Finding Optimal Diverse Feature Sets with Alternative Feature Selection"
published on arXiv in 2023.
You can find the paper here and the code here.
See the README
for details.
The datasets used in our study (which we also provide here) originate from PMLB.
The corresponding GitHub repository is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).
Please see the file LICENSE
in the folder datasets/
for the license text.
Experimental Data for the Paper "Finding Optimal Diverse Feature Sets with Alternative Feature Selection"
These are the experimental data for the paper
Bach, Jakob. "Finding Optimal Diverse Feature Sets with Alternative Feature Selection"
published at arXiv in 2023.
If we create multiple versions of this paper in the future, these experimental data will cover at least the first version.
Check our GitHub repository for the code and instructions to reproduce the experiments.
The data were obtained on a server with an AMD EPYC 7551
CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM.
The operating system was Ubuntu 20.04.6 LTS
.
The Python version was 3.8
.
With this configuration, running the experimental pipeline (run_experiments.py
) took about 255 h.
The commit hash for the last run of the experimental pipeline (run_experiments.py
) is 2212360a32
.
The commit hash for the last run of the evaluation pipeline (run_evaluation.py
) is 487b8ba04d
.
We also tagged both commits (run-2023-06-23
and evaluation-2023-07-04
).
The experimental data are stored in two folders, datasets/
and results/
.
Further, the console output of run_evaluation.py
is stored in Evaluation_console_output.txt
(manually copied from the console to a file).
In the following, we describe the structure and content of each data file.
datasets/
These are the input data for the experimental pipeline run_experiments.py
, i.e., prediction datasets.
The folder contains one overview file, one license file, and two files for each of the 30 datasets.
The original datasets were downloaded from PMLB with the script prepare_datasets.py
.
Note that we do not own the copyright for these datasets.
However, the GitHub repository of PMLB, which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).
Thus, we include the file LICENSE
from that repository.
After downloading from PMLB
, we split each dataset into the feature part (_X.csv
) and the target part (_y.csv
), which we save separately.
Both files are CSVs only containing numeric values (categorical features are ordinally encoded in PMLB
) except for the column names.
There are no missing values.
Each row corresponds to a data object (= instance, sample), and each column corresponds to a feature.
The first line in each CSV contains the names of the features as strings; for _y.csv
files, there is only one column, always named target
.
_dataset_overview.csv
contains meta-data for the datasets, like the number of instances and features.
results/
These are the output data of the experimental pipeline in the form of CSVs, produced by the script run_experiments.py
.
_results.csv
contains all results merged into one file and acts as input for the script run_evaluation.py
.
The remaining files are subsets of the results, as the experimental pipeline parallelizes over 30 datasets, 5 cross-validation folds, and 5 feature-selection methods.
Thus, there are 30 * 5 * 5 = 750
files containing subsets of the results.
Each row in a result file corresponds to one feature set.
One can identify individual search runs for alternatives with a combination of multiple columns, i.e.:
- dataset
dataset_name
- cross-validation fold
split_idx
- feature-selection method
fs_name
- search method
search_name
- objective aggregation
objective_agg
- feature-set size
k
- number of alternatives
num_alternatives
- dissimilarity threshold
tau_abs
The remaining columns mostly represent evaluation metrics.
In detail, all result files contain the following columns:
selected_idxs
(list of ints, e.g., [0, 4, 5, 6, 8]
): The indices (starting from 0) of the selected features (i.e., columns in the dataset).
Might also be an empty list, i.e., []
if no valid solution was found.
In that case, the two _objective
columns and the four _mcc
columns contain a missing value (empty string).
train_objective
(float in [-1,1]
): The training-set objective value of the feature set.
Three methods (FCBF, MI, Model Importance) have the range [0,1]
, while two methods (mRMR, Greedy Wrapper) have the range [-1,1]
.
test_objective
(float in [-1,1]
): The test-set objective value of the feature set.
optimization_time
(non-negative float): Time for alternative feature selection in seconds.
For white-box feature-selection methods, this measures one solver call.
In contrast, wrapper feature selection can call the solver multiple times, and we record the total runtime of the "Greedy Wrapper" algorithm.
optimization_status
(int in {0, 1, 2, 6}
): The status of the solver; for wrapper feature selection, this is only the status of the last solver call.
- 0 = (proven as) optimal
- 1 = feasible (valid solution, but might be suboptimal)
- 2 = (proven as) infeasible
- 6 = not solved (no valid solution found, but one might exist)
decision_tree_train_mcc
(float in [-1,1]
): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features.
decision_tree_test_mcc
(float in [-1,1]
): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a decision tree trained with the selected features.
random_forest_train_mcc
(float in [-1,1]
): Training-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features.
random_forest_test_mcc
(float in [-1,1]
): Test-set prediction performance (in terms of Matthews Correlation Coefficient) of a random forest trained with the selected features.
k
(int in {5, 10}
): The number of features to be selected.
tau_abs
(int in {1, ..., 10}
): The dissimilarity threshold for alternatives, corresponding to the absolute number of features (k * tau
) that have to differ between feature sets.
num_alternatives
(int in {1, 2, 3, 4, 5, 10}
): The number of desired alternative feature sets, not counting the original (zeroth) feature set.
A number from {1, 2, 3, 4, 5}
for simultaneous search and always 10
for sequential search.
objective_agg
(string, 2 different values): The name of the quality-aggregation function for alternatives (min
or sum
).
Min-aggregation or sum-aggregation for simultaneous search but always sum-aggregation for sequential search.
(In fact, sequential search only optimizes individual feature sets anyway, so the aggregation does not matter for the search.)
search_name
(string, 2 different values): The name of the search method for alternatives (search_sequentially
or search_simultaneously
).
fs_name
(string, 5 different values): The name of the feature-selection method (FCBFSelector
, MISelector
, ModelImportanceSelector
(= Model Gain), MRMRSelector
, or GreedyWrapperSelector
).
dataset_name
(string, 30 different values): The name of the PMLB
dataset.
n
(positive int): The number of features of the PMLB
dataset.
split_idx
(int in [0,4]
): The index of the cross-validation fold.
wrapper_iters
(int in [1,1000]
): The number of iterations in case wrapper feature selection was used, missing value (empty string) in the other cases.
This column does not exist in result files not containing wrapper results.
You can easily read in any of the result files with pandas
:
```python
import pandas as pd
results = pd.read_csv('results/_results.csv')
```
All result files are comma-separated and contain plain numbers and unquoted strings, apart from the column selected_features
(which is quoted and represents lists of integers).
The first line in each result file contains the column names.
You can use the following code to make sure that lists of feature indices are treated as such (rather than strings):
```python
import ast
results['selected_idxs'] = results['selected_idxs'].apply(ast.literal_eval)
```