This project is part of the Pilot implementation of annotation of single-cell RNA-seq data guided by AI
project from the Swiss Institute of Bioinformatics (SIB), which aims at developing and implementing a machine learning approach to guide cell type classification for large datasets on ASAP (Automated Single-cell Analysis Portal) and Bgee (a gene expression database of SIB) resources. It aims at benchmarking both single label and path (multi-label) annotation pipelines for the datasets: Bgee and ASAP resources from scRNA-seq. We compared four different preprocessing methods enclosing covariants removal and dimension reduction. Then we calibrated and compared five flat models, two local models and two global models. A report with detailed explanation can be found here on Overleaf.
The project is currently being built on Euler. The configuration steps are as below.
-
Log in to your server via
ssh
andgit clone
the repository to$HOME
-
Load the modules by running the command
source /cluster/apps/local/env2lmod.sh
set_software_stack.sh new
module load gcc/8.2.0
module load python/3.10.4
module load hdf5/1.10.1
- Create a new virtual environment by running the command
python -m venv ~/sib_ai_benchmark/.venv
- Activate the virtual environment by running the following command. This will change your shell prompt to indicate that you are now working inside the virtual environment. Run
deactivate
to exit the virtual environment.
source ~/sib_ai_benchmark/.venv/bin/activate
- Install and update required packages by running
pip3 install --upgrade -r ~/sib_ai_benchmark/requirements.txt
- Run the application by running the following command.
~/sib_ai_benchmark/src/app.py
should be replaced by your own script in case of testing the configuration.
bsub -n 10 -W 24:00 -o log -R "rusage[mem=2048]" python ~/sib_ai_benchmark/src/app.py
- Enable the environments upon next login session and set alias by running
cat <<EOF >> ~/.bashrc
module load gcc/8.2.0
module load python/3.10.4
module load hdf5/1.10.1
source ~/sib_ai_benchmark/.venv/bin/activate
alias prun='bsub -n 4 -W 24:00 -o log -R "rusage[mem=4096]" python'
EOF
- Reload
.bashrc
by running the command
source ~/.bashrc
The Benchmark architecture is introduced in a report which can be found here on Overleaf.
Processed data per pre-processing method as well as the hierarchical information are stored here on google drive. gDrive
can be used for downloading. Data should be put under the data-raw
folder as listed below.
data-raw
│ ├── pca_
│ └── scanvi_bcm
... ...
│ └── sib_cell_type_hierarchy.tsv
The following commands are supposed to be run under the sib_ai_benchmark
folder.
Comparing preprocessing steps on flat models. They are supposed to be run separately.
python src/app.py -e scanvi_bcm -m flat
python src/app.py -e scanvi_b -m flat
python src/app.py -e scanvi_ -m flat
python src/app.py -e pca_ -m flat
Comparison on annotation label.
python src/app.py -e scanvi_bcm -m flat local global
To reuse the results of flat models from the above, the commands can be simplified to avoid run flat models again.
python src/app.py -e scanvi_bcm -m local global
Comparison on path evaluation.
python src/app.py -e scanvi_bcm -m local global -p
Run a single model or multiple models with a pre-processing method.
python src/app.py -e scanvi_bcm -m NeuralNet
python src/app.py -e scanvi_bcm -m NeuralNet LinearSVM
Upon completion, the app generates a pickle object and a PDF file. The pickle object encapsulates a dictionary containing detailed benchmarking results, while the PDF file visualizes a selected metric through a plot. Additionally, a log file is provisioned upon app launch to enable real-time monitoring of results.
A new model can be added to one of the folders under the parent folder ~/sib_ai_benchmark/src/models
: flatModels
, globalModels
or localModels
. It should be wrapped with a specific name which can be referred for running. Please check existing model file for use case.
To add a new dataset, an entry can be added to the dictionary experiments
in cfg.py
file in ~/sib_ai_benchmark/src/config
.The dictionary key can be referred to run the specific method with the new dataset same as the previous section.
The analysis of the benchmarking results is organized as a report which can be found here on Overleaf.