Skip to content

Code and datasets for our comprehensive review of online learning models that handle dynamic, missing, or evolving input features. Includes implementations and tools for evaluation across multiple datasets.

Notifications You must be signed in to change notification settings

bioailab/HaphazardInputsReview-Archive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Online Learning under Haphazard Input Conditions: A Comprehensive Review and Analysis

Citation

Please consider citing the below paper, if you are using the code provided in this repository.

@article{agarwal2024online,
  title={Online Learning under Haphazard Input Conditions: A Comprehensive Review and Analysis},
  author={Agarwal, Rohit and Das, Arijit and Horsch, Alexander and Agarwal, Krishna and Prasad, Dilip K},
  journal={arXiv preprint arXiv:2404.04903},
  year={2024}
}

Overview

This repository contains datasets and implementation codes of different models for the paper, titled "Online Learning under Haphazard Input Conditions: A Comprehensive Review and Analysis".

File Structure of the Directory

HaphazardInputsReview/
Code/
┃ ┣ AnalyseResults/
┃ ┣ Config/
┃ ┣ DataCode/
┃ ┣ Models/
┃ ┣ main.py
┃ ┣ read_results.py
┃ ┗ requirements.txt
Data/
Results/
.gitignore
README.md

Datasets

We use 20 different datasets for this project. The link of all the datasets can be found below. Moreover, some of the datasets are also given in their respective folders inside Data/ directory. To run them, please download the datsets files form the given link below and place them inside their respective directories (see instructions for each dataset below...).

Small Datsets


Note: Two more small datasets, used for analysis namely, crowdsense(c3) and crowdsense(c5), are not provided due to their unavailability in public domain.

Medium Datsets


Large Datsets


Raw Data Transformation

Some of the data sets need to be cleaned and processesed before they can be used in the models for inference. Details of how to process those datasets are given below.

  • Spamassasin

    • Download the files form the link provided and unzip them.
    • Use the sctipt Code\DataStorage\DataPreparation\data_spamassasin_conversion.py to clean the data.
    • Modify the 'path' variable at line 13 to the path of the directory where the unzipped files are located.
    • The data will automatically be saved in the appropriate directory.
  • IMDB

    • Download the files form the link provided and unzip them.
    • Use the sctipt Code\DataStorage\DataPreparation\data_imdb_conversion.py to clean the data.
    • Modify the 'data_path' variable at line 10 to the path of the directory where the unzipped files are located.
    • The data will automatically be saved in the appropriate directory.
  • Diabetes_us

Dataset Preparation

Variable P

For synthetic datasets, we varied the availability of each auxiliary input feature independently by a uniform distribution of probability $p$, i.e., each auxilairy feature is available for $100p%$. For more information about this, follow paper - Aux-Net (https://link.springer.com/chapter/10.1007/978-3-031-30105-6_46)

Files

To run the models, see Code/main.py. All the comparison models can be run from this.
After running a model on a certain dataset, run Code/read_results.py to display and save the evaluation in csv format.

Control Parameters

For main.py file,

  1. seed : Seed value
    default = 2023

  2. type: The type of the experiment
    default="noassumption", type=str,
    choices = ["noassumption", "basefeatures", "bufferstorage"]

Data Variables


  1. dataname: The name of the dataset
    default = "wpbc"
    choices = ["all", "synthetic", "crowdsense_c5", "crowdsense_c3" "spamassasin", "imdb", "diabetes_us", "higgs", "susy", "a8a" "magic04", "spambase", "krvskp", "svmguide3", "ipd", "german" "diabetes_f", "wbc", "australian", "wdbc", "ionosphere", "wpbc"]

  2. syndatatype: The type to create suitable synthetic dataset
    default = "variable_p"

  3. probavailable: The probability of each feature being available to create synthetic data
    default = 0.5, type = float,

  4. ifbasefeat: If base features are available
    default = False

Method Variables


  1. methodname: The name of the method (model)
    default = "nb3"
    choices = ["nb3", "fae", "olvf", "ocds", "ovfm", "dynfo", "orf3v", "auxnet", "auxdrop"]

  2. initialbuffer: The storage size of initial buffer trainig
    default = 0

  3. ifimputation: If some features needs to be imputed
    default = False

  4. imputationtype: The type of imputation technique to create base features
    default = 'forwardfill'
    choices = ['forwardfill', 'forwardmean', 'zerofill']

  5. nimputefeat: The number of imputation features
    default = 2

  6. ifdummyfeat: If some dummy features needs to be created
    default = False

  7. dummytype: The type of technique to create dummy base features
    default = 'standardnormal'

  8. ndummyfeat: The number of dummy features to create'
    default = 1

  9. ifAuxDropNoAssumpArchChange: If the Aux-Drop architecture needs to be changed to handle no assumption default = False

  10. nruns: The number of times a method should runs (For navie Bayes, it would be 1 because it is a deterministic method)
    default = 5

For read_results.py file,

  1. type: The type of the experiment
    default ="noassumption"
    choices = ["noassumption", "basefeatures" "bufferstorage"]

  2. dataname: The name of the dataset
    default = "wpbc"
    choices = ["synthetic", "real", "crowdsense_c5", "crowdsense_c3", "spamassasin", "imdb", "diabetes_us", "higgs", "susy", "a8a", "magic04", "spambase", "krvskp", "svmguide3", "ipd", "german", "diabetes_f", "wbc", "australian", "wdbc", "ionosphere", "wpbc"]

  3. probavailable: The probability of each feature being available to create synthetic data
    default = 0.5

  4. methodname: The name of the method
    default = "nb3"
    choices = ["nb3", "fae", "olvf", "ocds", "ovfm", "dynfo", "orf3v", "auxnet", "auxdrop"]

Dependencies

  1. numpy
  2. torch
  3. pandas
  4. random
  5. tqdm
  6. os
  7. pickle
  8. tdigest (version == 0.5.2.2)
  9. statsmodels (version == 0.14.0)

Running the code

To run the models, change the control parameters accordingly in the main.py file and run

python Code/main.py

Example: To run model nb3 on wpbc dataset, with probability of available features 0.75, use the code below

python Code/main.py --dataname wpbc --probavailable 0.75 --methodname nb3

Note: For auxnet , set either of --ifimputation True or --ifdummyfeat True, and for auxdrop set --ifAuxDropNoAssumpArchChange True (as these models were modified from their original implementation to support the absence of (previously required) base-feature)

python Code/main.py --dataname ionosphere --probavailable 0.75 --methodname auxnet --ifimputation True

or

python Code/main.py --dataname ionosphere --probavailable 0.75 --methodname auxnet --ifdummyfeat True

and

python Code/main.py --dataname synthetic --probavailable 0.75 --methodname auxdrop --ifAuxDropNoAssumpArchChange True

To read the results and save them in .csv format, run read_results.py with appropriate control parameters.

python Code/read_results.py

Example: To read the results of nb3 on wpbc dataset, with probability of available features 0.75, use the code below

python Code/read_results.py --dataname wpbc --probavailable 0.75 --methodname nb3

About

Code and datasets for our comprehensive review of online learning models that handle dynamic, missing, or evolving input features. Includes implementations and tools for evaluation across multiple datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published