See the author's git repository here: https://github.com/Rohit102497/HaphazardInputsReview/tree/main
Please consider citing the below paper, if you are using the code provided in this repository.
@article{agarwal2024online,
title={Online Learning under Haphazard Input Conditions: A Comprehensive Review and Analysis},
author={Agarwal, Rohit and Das, Arijit and Horsch, Alexander and Agarwal, Krishna and Prasad, Dilip K},
journal={arXiv preprint arXiv:2404.04903},
year={2024}
}
This repository contains datasets and implementation codes of different models for the paper, titled "Online Learning under Haphazard Input Conditions: A Comprehensive Review and Analysis".
HaphazardInputsReview/
┣ Code/
┃ ┣ AnalyseResults/
┃ ┣ Config/
┃ ┣ DataCode/
┃ ┣ Models/
┃ ┣ main.py
┃ ┣ read_results.py
┃ ┗ requirements.txt
┣ Data/
┣ Results/
┣ .gitignore
┗ README.md
We use 20 different datasets for this project. The link of all the datasets can be found below. Moreover, some of the datasets are also given in their respective folders inside Data/
directory. To run them, please download the datsets files form the given link below and place them inside their respective directories (see instructions for each dataset below...).
Small Datsets
-
Data link: https://archive.ics.uci.edu/dataset/16/breast+cancer+wisconsin+prognostic
Directory:DataStorage/wbc/
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/52/ionosphere
Directory:DataStorage/ionosphere/
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
Directory:DataStorage/wdbc/
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/143/statlog+australian+credit+approval
Directory:DataStorage/australian/
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original
Directory:DataStorage/wbc/
(provided in repository/not provided in repository) -
Data link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Directory:DataStorage/diabetes_f/
(provided in repository/not provided in repository)
Instructions: After downloading the file change it's name fromdiabetes.csv
todiabetes_f.csv
-
Data link: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data
Directory:DataStorage/german
(provided in repository/not provided in repository) -
Data link: https://www.timeseriesclassification.com/description.php?Dataset=ItalyPowerDemand
Directory:DataStorage/ipd
(provided in repository/not provided in repository)
Instructions: Download the dataset from the link, and place the filesItalyPowerDemand_TEST.txt
andItalyPowerDemand_TRAIN.txt
inside the directory. -
Data link: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#svmguide3
Directory:DataStorage/svmguide3
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/22/chess+king+rook+vs+king+pawn
Directory:DataStorage/krvskp
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/94/spambase
Directory:DataStorage/spambase
(provided in repository/not provided in repository) -
Data link: https://spamassassin.apache.org/old/publiccorpus/
Directory:DataStorage/spamassasin
(provided in repository/not provided in repository)
Note: Two more small datasets, used for analysis namely, crowdsense(c3) and crowdsense(c5), are not provided due to their unavailability in public domain.
Medium Datsets
-
Data link: https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope
Directory:DataStorage/magic04
(provided in repository/not provided in repository) -
Data link: https://ai.stanford.edu/~amaas/data/sentiment/
Directory:DataStorage/imdb
(provided in repository/not provided in repository) -
Data link: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a8a
Directory:DataStorage/a8a
(provided in repository/not provided in repository)
Large Datsets
-
Data link: Supplementary Material at https://www.hindawi.com/journals/bmri/2014/781670/#supplementary-materials
Directory:DataStorage/diabetes_us
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/279/susy
Directory:DataStorage/susy
(provided in repository/not provided in repository) -
Data link: https://archive.ics.uci.edu/dataset/280/higgs
Directory:DataStorage/higgs
(provided in repository/not provided in repository)
Some of the data sets need to be cleaned and processesed before they can be used in the models for inference. Details of how to process those datasets are given below.
-
Spamassasin
- Download the files form the link provided and unzip them.
- Use the sctipt
Code\DataStorage\DataPreparation\data_spamassasin_conversion.py
to clean the data. - Modify the 'path' variable at line 13 to the path of the directory where the unzipped files are located.
- The data will automatically be saved in the appropriate directory.
-
IMDB
- Download the files form the link provided and unzip them.
- Use the sctipt
Code\DataStorage\DataPreparation\data_imdb_conversion.py
to clean the data. - Modify the 'data_path' variable at line 10 to the path of the directory where the unzipped files are located.
- The data will automatically be saved in the appropriate directory.
-
Diabetes_us
- After downloading the dataset from the provided link, follow the instructions at https://www.hindawi.com/journals/bmri/2014/781670/#supplementary-materials to prepare it for analysis
For synthetic datasets, we varied the availability of each auxiliary input feature independently by a uniform distribution of probability
To run the models, see Code/main.py
. All the comparison models can be run from this.
After running a model on a certain dataset, run Code/read_results.py
to display and save the evaluation in csv format.
For main.py file,
-
seed
: Seed value
default = 2023 -
type
: The type of the experiment
default="noassumption", type=str,
choices = ["noassumption", "basefeatures", "bufferstorage"]
Data Variables
-
dataname
: The name of the dataset
default = "wpbc"
choices = ["all", "synthetic", "crowdsense_c5", "crowdsense_c3" "spamassasin", "imdb", "diabetes_us", "higgs", "susy", "a8a" "magic04", "spambase", "krvskp", "svmguide3", "ipd", "german" "diabetes_f", "wbc", "australian", "wdbc", "ionosphere", "wpbc"] -
syndatatype
: The type to create suitable synthetic dataset
default = "variable_p" -
probavailable
: The probability of each feature being available to create synthetic data
default = 0.5, type = float, -
ifbasefeat
: If base features are available
default = False
Method Variables
-
methodname
: The name of the method (model)
default = "nb3"
choices = ["nb3", "fae", "olvf", "ocds", "ovfm", "dynfo", "orf3v", "auxnet", "auxdrop"] -
initialbuffer
: The storage size of initial buffer trainig
default = 0 -
ifimputation
: If some features needs to be imputed
default = False -
imputationtype
: The type of imputation technique to create base features
default = 'forwardfill'
choices = ['forwardfill', 'forwardmean', 'zerofill'] -
nimputefeat
: The number of imputation features
default = 2 -
ifdummyfeat
: If some dummy features needs to be created
default = False -
dummytype
: The type of technique to create dummy base features
default = 'standardnormal' -
ndummyfeat
: The number of dummy features to create'
default = 1 -
ifAuxDropNoAssumpArchChange
: If the Aux-Drop architecture needs to be changed to handle no assumption default = False -
nruns
: The number of times a method should runs (For navie Bayes, it would be 1 because it is a deterministic method)
default = 5
For read_results.py file,
-
type
: The type of the experiment
default ="noassumption"
choices = ["noassumption", "basefeatures" "bufferstorage"] -
dataname
: The name of the dataset
default = "wpbc"
choices = ["synthetic", "real", "crowdsense_c5", "crowdsense_c3", "spamassasin", "imdb", "diabetes_us", "higgs", "susy", "a8a", "magic04", "spambase", "krvskp", "svmguide3", "ipd", "german", "diabetes_f", "wbc", "australian", "wdbc", "ionosphere", "wpbc"] -
probavailable
: The probability of each feature being available to create synthetic data
default = 0.5 -
methodname
: The name of the method
default = "nb3"
choices = ["nb3", "fae", "olvf", "ocds", "ovfm", "dynfo", "orf3v", "auxnet", "auxdrop"]
- numpy
- torch
- pandas
- random
- tqdm
- os
- pickle
- tdigest (version == 0.5.2.2)
- statsmodels (version == 0.14.0)
To run the models, change the control parameters accordingly in the main.py file and run
python Code/main.py
Example: To run model nb3
on wpbc
dataset, with probability of available features 0.75, use the code below
python Code/main.py --dataname wpbc --probavailable 0.75 --methodname nb3
Note: For
auxnet
, set either of--ifimputation True
or--ifdummyfeat True
, and forauxdrop
set--ifAuxDropNoAssumpArchChange True
(as these models were modified from their original implementation to support the absence of (previously required) base-feature)
python Code/main.py --dataname ionosphere --probavailable 0.75 --methodname auxnet --ifimputation True
or
python Code/main.py --dataname ionosphere --probavailable 0.75 --methodname auxnet --ifdummyfeat True
and
python Code/main.py --dataname synthetic --probavailable 0.75 --methodname auxdrop --ifAuxDropNoAssumpArchChange True
To read the results and save them in .csv format, run read_results.py with appropriate control parameters.
python Code/read_results.py
Example: To read the results of nb3
on wpbc
dataset, with probability of available features 0.75, use the code below
python Code/read_results.py --dataname wpbc --probavailable 0.75 --methodname nb3