Datasets
This website provides a set of public benchmark datasets to evaluate learning algorithms in nonstationary environments. In particular, it provides datasets with incremental and gradual concept drifts. These datasets are well-suited for evaluation of stream algorithms that do not require actual labels during the online classification phase. A condition known as extreme verification latency.
We hope that this benchmark will encourage other researchers to share their data, code and detailed results, improving the reproducibility in the area.
For a better understanding of the properties of each dataset, see an animated visualization of each dataset.
Download (all datasets ~15MB)
Stream Classification Algorithm Guided by Clustering - SCARGC
Download (Source code)
How to cite this benchmark?
Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A. : Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency. SIAM International Conference on Data Mining (SDM), pp. 873-881, 2015.
DOI: http://dx.doi.org/10.1137/1.9781611974010.98
Dataset donnors:
[1] - These datasets were kindly provided by the authors of the following paper: Dyer, K.B., Capo, R., Polikar,R. : COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data. IEEE Transactions on Neural Networks and Learning Systems, Vol. 25, No. 1, pp. 12-26, 2014.
[2] - Ditzler, G., Polikar, R. : Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 10, pp. 2283-2301, 2013.
[3] - Dataset based on CMU dataset first presented by the authors of the following paper: Killourhy, K., Maxion, R. : Why did my detector do that?! In Recent Advances in Intelligent Data Analysis X, pp. 222-233,2011