chopin2

Supervised Classification with Hyperdimensional Computing.

Originally forked from https://github.com/moimani/HD-Permutaion

This repository includes some Python 3.8 utilities to build a Hyperdimensional Computing classification model according to the architecture originally introduced in https://doi.org/10.1109/DAC.2018.8465708

The src/generators folder contains two Python 3.8 scripts able to create training a test datasets with randomly selected samples from:

BRCA, KIRP, and THCA DNA-Methylation data from the paper Classification of Large DNA Methylation Datasets for Identifying Cancer Drivers by Fabrizio Celli, Fabio Cumbo, and Emanuel Weitschek;
Gene-expression quantification and Methylation Beta Value experiments provided by OpenGDC for all the 33 different types of tumors of the TCGA program.

Due to the size of the datasets, they have not been reported on this repository but can be retrieved from:

ftp://bioinformatics.iasi.cnr.it/public/bigbiocl_dna-meth_data/
http://geco.deib.polimi.it/opengdc/ and https://github.com/cumbof/OpenGDC/

The isolet dataset is part of the original forked version of the repository and it has been maintained in order to provide a simple toy model for testing purposes only.

Install

We deployed chopin2 as a Python 3.8 package that can be installed through pip and conda, as well as a Docker image.

Please, use one of the following commands to start playing with chopin2:

# Install chopin2 with pip
pip install chopin2

# Install chopin2 with conda
conda install -c conda-forge chopin2

# Initialise the Docker image
docker build -t chopin2 .
docker run -it chopin2

Please note that chopin2 is also available as a Galaxy tool. It's wrapper is available under the official Galaxy ToolShed at https://toolshed.g2.bx.psu.edu/view/fabio/chopin2

Usage

Once installed, you are ready to start playing with chopin2.

Try running the following command to run chopin2 on the isolet dataset:

chopin2 --dimensionality 10000 \
        --levels 100 \
        --retrain 10 \
        --pickle ../dataset/isolet/isolet.pkl \
        --psplit_training 80 \
        --dump \
        --nproc 4 \
        --verbose

In order to run it on Spark, other arguments must be specified:

chopin2 --dimensionality 10000 \
        --levels 100 \
        --retrain 10 \
        --pickle ../dataset/isolet/isolet.pkl \
        --psplit_training 80 \
        --dump \
        --spark \
        --slices 10 \
        --master local \
        --memory 2048m \
        --verbose

List of standard arguments:

--dimensionality    -- Dimensionality of the HD model (default 10000)
--levels            -- Number of level hypervectors (default 2)
--retrain           -- Number of retraining iterations (default 0)
--stop              -- Stop retraining if the error rate does not change (default False)
--dataset           -- Path to the dataset file
--fieldsep          -- Field separator (default ",")
--psplit_training   -- Percentage of observations that will be used to train the model. 
                       The remaining percentage will be used to test the classification model
--crossv_k          -- Number of folds for cross validation.
                       Cross validate HD models if --k_folds greater than 1
--seed              -- Seed for reproducing random sampling of the observations in the dataset 
                       and build both the training and test set (default 0)
--pickle            -- Path to the pickle file. If specified, "--dataset", "--fieldsep", and "--training" parameters are not used
--dump              -- Build a summary and log files (default False)
--cleanup           -- Delete the classification model as soon as it produces the prediction accuracy (default False)
--keep_levels       -- Do not delete the level hypervectors. It works in conjunction with --cleanup only (default True)
--nproc             -- Number of parallel jobs for the creation of the HD model.
                       This argument is ignored if --spark is enabled (default 1)
--verbose           -- Print results in real time (default False)
--cite              -- Print references and exit
-v, --version       -- Print the current chopin2.py version and exit

List of arguments to enable backward variable selection:

--features                     -- Path to a file with a single column containing the whole set or a subset of feature
--select_features              -- This triggers the backward variable selection method for the identification of the most significant features.
                                  Warning: computationally intense!
--group_min                    -- Minimum number of features among those specified with the --features argument (default 1)
--accuracy_threshold           -- Stop the execution if the best accuracy achieved during the previous group of runs is lower than this number (default 60.0)
--accuracy_uncertainty_perc    -- Take a run into account even if its accuracy is lower than the best accuracy achieved in the same group minus its "accuracy_uncertainty_perc" percent

List of argument for the execution of the classifier on a Spark distributed environment:

--spark     -- Build the classification model in a Apache Spark distributed environment
--slices    -- Number of slices in case --spark argument is enabled. 
               This argument is ignored if --gpu is enabled
--master    -- Master node address
--memory    -- Executor memory

List of arguments for the execution of the classifier on NVidia powered GPUs:

--gpu       -- Build the classification model on an NVidia powered GPU. 
               This argument is ignored if --spark is specified
--tblock    -- Number of threads per block in case --gpu argument is enabled. 
               This argument is ignored if --spark is enabled

Credits

Please credit our work in your manuscript by citing:

Fabio Cumbo, Eleonora Cappelli, and Emanuel Weitschek, "A brain-inspired hyperdimensional computing approach for classifying massive DNA methylation data of cancer", MDPI Algorithms, 2020 https://doi.org/10.3390/a13090233

Fabio Cumbo, Emanuel Weitschek, and Daniel Blankenberg, "hdlib: A Python library for designing Vector-Symbolic Architectures", Journal of Open Source Software, 2023 https://doi.org/10.21105/joss.05704

Do not forget to also cite the following paper from which this works takes inspiration:

Mohsen Imani, Chenyu Huang , Dequian Kong, Tajana Rosing, "Hierarchical Hyperdimensional Computing for Energy Efficient Classification", IEEE/ACM Design Automation Conference (DAC), 2018 https://doi.org/10.1109/DAC.2018.8465708

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
chopin2		chopin2
dataset		dataset
recipe		recipe
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chopin2

Originally forked from https://github.com/moimani/HD-Permutaion

Install

Usage

Credits

About

Releases 7

Contributors 2

Languages

License

cumbof/chopin2

Folders and files

Latest commit

History

Repository files navigation

chopin2

Originally forked from https://github.com/moimani/HD-Permutaion

Install

Usage

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Contributors 2

Languages