scAEGAN- Unification of Single-Cell Genomics Data by Adversarial Learning of Latent Space Correspondences
This repository contains the scAEGAN code and online data for the single-omics and multi-omics integration. It also contains the code to evaluate and visualize the integration results. Metrics are available for quantifying outputs quality.
scAEGAN is a python based deep learning model that is designed for single-cell-omics and multi-omics integration. scAEGAN performs this by using an Autoencoder which learns a low-dimensional embedding of each experiment independently, respecting each sample's uniqueness, protocol. Next, cycleGAN learns a non-linear mapping between these two Autoencoder representations.
The required libraries are included in environment file. In order to install these libraries, follow the following steps:
- Creating the conda environment with the following command. This will create and install the libraries included in the environment.yml file for training the scAEGAN.
conda env create --prefix ./env --file environment.yml --force
- The second step is to activate the conda envirnoment.
conda activate ./env
-
Optionally, run R_libraries.R, in tested R version 4.1.1, to install automatically the libraries and dependencies required for the scripts in the Evaluation folder.
-
scAEGAN is simply installed by cloning the repository.
git clone https://github.com/sumeer1/scAEGAN.git
cd scAEGAN/
Simulated data : Two datasets containing 600 cells from 5 populations and with 3000 genes each were simulated using SymSim (Zhang, X et al.).
Real data: The pre-processed mouse hematopoietic stem cell dataset of young and old individuals downloaded from https://github.com/quon-titative-biology/scalign.
There are two steps for the basic usage after activating the conda environment.
- Training the autoencoder with the given parameters to get the latent representation by running.
python AE.py --input_file1 <Specifies the domainA input file (cell by gene matrix in csv format)> \
--input_file2 <Specifies the domainB input file (cell by gene matrix in csv format)> \
--output_file1 <Specifies the low dimensional representation of the input1 from the autoencoder> \
--output_file2 <Specifies the low dimensional representation of the input2 from the autoencoder> \
--batch_size <Specifies the batch size to train the autoencoder, default=16> \
--epochs <Specifies the number of epochs for which autoencoder is trained, default=200> \
--dropout <Specifies the dropout rate used to train the autoencoder, default=0.2> \
--learning_rate <Specifies the learning rate, default=0.0001>
- Training the cyclegan with the given parameters on latent representations obtained from the Autoencoder by running.
python cGANtrain.py --data_path <Specifies the folder path to the training and testing data> \
--train_file <Specifies the training files for training the cGAN for both domains (A and B) that are to be integrated.
For instance --train_file domain_A.csv domain_B.csv \
--test_file <Specifies the testing files. For instance --test_file domain_A.csv domain_B.csv> \
--save_path <Specifies the folder path where the output from the cGAN in the csv format will be saved> \
--input_shape <Specifies the shape of the input, default=50> \
--batch_size <Specifies the batch size, default=4> \
--epochs <Specifies the number of epochs for training cGAN, default=400>
- In this tutorial we show how to run scAEGAN on the Simulated Data. We have prepared the required input dataset which you can find in the Simulated_Data folder.
- We created a command-line interface for scAEGAN that allows it to be run in a high-performance computing environment. Because scAEGAN is built with tensorflow/keras, we recommend running it on GPUs to significantly reduce run-time. It has been tested on Linux and OS X platforms.
- The experiments were performed on a Linux server using an Intel Xeon CPU E5-2680 v4 @ 2.40GHz processor with 128 GB RAM and an NVIDIA Tesla V100 GPU.
- For model training and evaluation, a vignette presents an example how to run the scAEGAN and carry out the benchmarking using the Evaluation folder scripts.