Benchmark of Cell-typing with both Hierarchical and Non-hierarchical Labels

Introduction

This project is part of the Pilot implementation of annotation of single-cell RNA-seq data guided by AI project from the Swiss Institute of Bioinformatics (SIB), which aims at developing and implementing a machine learning approach to guide cell type classification for large datasets on ASAP (Automated Single-cell Analysis Portal) and Bgee (a gene expression database of SIB) resources. It aims at benchmarking both single label and path (multi-label) annotation pipelines for the datasets: Bgee and ASAP resources from scRNA-seq. We compared four different preprocessing methods enclosing covariants removal and dimension reduction. Then we calibrated and compared five flat models, two local models and two global models. A report with detailed explanation can be found here on Overleaf.

Environment Setup

Configuring Euler

The project is currently being built on Euler. The configuration steps are as below.

Log in to your server via ssh and git clone the repository to $HOME
Load the modules by running the command

source /cluster/apps/local/env2lmod.sh
set_software_stack.sh new

module load gcc/8.2.0 
module load python/3.10.4
module load hdf5/1.10.1

Create a new virtual environment by running the command

python -m venv ~/sib_ai_benchmark/.venv

Activate the virtual environment by running the following command. This will change your shell prompt to indicate that you are now working inside the virtual environment. Rundeactivate to exit the virtual environment.

source ~/sib_ai_benchmark/.venv/bin/activate

Install and update required packages by running

pip3 install --upgrade -r   ~/sib_ai_benchmark/requirements.txt

Run the application by running the following command. ~/sib_ai_benchmark/src/app.py should be replaced by your own script in case of testing the configuration.

bsub -n 10  -W 24:00 -o log -R "rusage[mem=2048]" python ~/sib_ai_benchmark/src/app.py

Enable the environments upon next login session and set alias by running

cat <<EOF >> ~/.bashrc

module load gcc/8.2.0 
module load python/3.10.4
module load hdf5/1.10.1

source ~/sib_ai_benchmark/.venv/bin/activate

alias prun='bsub -n 4  -W 24:00 -o log -R "rusage[mem=4096]" python'
EOF

Reload .bashrc by running the command

source ~/.bashrc

Benchmark architecture

The Benchmark architecture is introduced in a report which can be found here on Overleaf.

Run Benchmark

Download data

Processed data per pre-processing method as well as the hierarchical information are stored here on google drive. gDrive can be used for downloading. Data should be put under the data-raw folder as listed below.

data-raw
│   ├── pca_
│   └── scanvi_bcm
...    ...
│   └── sib_cell_type_hierarchy.tsv

Commands to run

The following commands are supposed to be run under the sib_ai_benchmark folder.

Comparing preprocessing steps on flat models. They are supposed to be run separately.

python src/app.py -e scanvi_bcm -m flat
python src/app.py -e scanvi_b -m flat
python src/app.py -e scanvi_ -m flat
python src/app.py -e pca_ -m flat

Comparison on annotation label.

python src/app.py -e scanvi_bcm -m flat local global

To reuse the results of flat models from the above, the commands can be simplified to avoid run flat models again.

python src/app.py -e scanvi_bcm -m local global

Comparison on path evaluation.

python src/app.py -e scanvi_bcm -m local global -p

Run a single model or multiple models with a pre-processing method.

python src/app.py -e scanvi_bcm -m NeuralNet
python src/app.py -e scanvi_bcm -m NeuralNet LinearSVM

Upon completion, the app generates a pickle object and a PDF file. The pickle object encapsulates a dictionary containing detailed benchmarking results, while the PDF file visualizes a selected metric through a plot. Additionally, a log file is provisioned upon app launch to enable real-time monitoring of results.

Add new model or data

A new model can be added to one of the folders under the parent folder ~/sib_ai_benchmark/src/models: flatModels, globalModels or localModels . It should be wrapped with a specific name which can be referred for running. Please check existing model file for use case.

To add a new dataset, an entry can be added to the dictionary experiments in cfg.py file in ~/sib_ai_benchmark/src/config.The dictionary key can be referred to run the specific method with the new dataset same as the previous section.

Results

The analysis of the benchmarking results is organized as a report which can be found here on Overleaf.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data-raw		data-raw
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark of Cell-typing with both Hierarchical and Non-hierarchical Labels

Table of Contents

Introduction

Environment Setup

Configuring Euler

Benchmark architecture

Run Benchmark

Download data

Commands to run

Add new model or data

Results

About

Releases

Packages

Contributors 3

Languages

License

dnwissel/sib_ai_benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmark of Cell-typing with both Hierarchical and Non-hierarchical Labels

Table of Contents

Introduction

Environment Setup

Configuring Euler

Benchmark architecture

Run Benchmark

Download data

Commands to run

Add new model or data

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages