GitHub - munozdjp/scAEGAN: scAEGAN- Data integration by learning non-linear mappings between distinct latent spaces while respecting different technologies, data-modalities, and experimental samples

scAEGAN- Unification of Single-Cell Genomics Data by Adversarial Learning of Latent Space Correspondences

This repository contains the scAEGAN code and online data for the single-omics and multi-omics integration. It also contains the code to evaluate and visualize the integration results. Metrics are available for quantifying outputs quality.

Summary
Installation Requisites
Datasets
Usage
Running Example

Summary

scAEGAN is a python based deep learning model that is designed for single-cell-omics and multi-omics integration. scAEGAN performs this by using an Autoencoder which learns a low-dimensional embedding of each experiment independently, respecting each sample's uniqueness, protocol. Next, cycleGAN learns a non-linear mapping between these two Autoencoder representations.

scAEGAN Workflow

Installation Requisites

The required libraries are included in environment file. In order to install these libraries, follow the following steps:

Creating the conda environment with the following command. This will create and install the libraries included in the environment.yml file for training the scAEGAN.

conda env create --prefix ./env --file environment.yml --force

The second step is to activate the conda envirnoment.

conda activate ./env

Optionally, run R_libraries.R, in tested R version 4.1.1, to install automatically the libraries and dependencies required for the scripts in the Evaluation folder.
scAEGAN is simply installed by cloning the repository.

git clone https://github.com/sumeer1/scAEGAN.git

cd scAEGAN/

Datasets

Simulated data : Two datasets containing 600 cells from 5 populations and with 3000 genes each were simulated using SymSim (Zhang, X et al.).

Real data: The pre-processed mouse hematopoietic stem cell dataset of young and old individuals downloaded from https://github.com/quon-titative-biology/scalign.

Usage

There are two steps for the basic usage after activating the conda environment.

Training the autoencoder with the given parameters to get the latent representation by running.

python AE.py --input_file1 <Specifies the domainA input file (cell by gene matrix in csv format)> \
             --input_file2 <Specifies the domainB input file (cell by gene matrix in csv format)>  \
             --output_file1 <Specifies the low dimensional representation of the input1 from the autoencoder> \
             --output_file2 <Specifies the low dimensional representation of the input2 from the autoencoder> \
             --batch_size <Specifies the batch size to train the autoencoder, default=16>  \
             --epochs <Specifies  the number of epochs for which autoencoder is trained, default=200> \
             --dropout <Specifies the dropout rate used to train the autoencoder, default=0.2> \
             --learning_rate <Specifies the learning rate, default=0.0001>

Training the cyclegan with the given parameters on latent representations obtained from the Autoencoder by running.

python cGANtrain.py --data_path <Specifies the folder path to the training and testing data> \
                    --train_file <Specifies the training files for training the cGAN for both domains (A and B) that are to be integrated. 
                    For instance --train_file domain_A.csv domain_B.csv \
                    --test_file <Specifies the testing files. For instance --test_file domain_A.csv domain_B.csv> \
                    --save_path <Specifies the folder path where the output from the cGAN in the csv format will be saved> \
                    --input_shape <Specifies the shape of the input, default=50> \
                    --batch_size <Specifies the batch size, default=4> \
                    --epochs <Specifies the number of epochs for training cGAN, default=400>

Running Example

In this tutorial we show how to run scAEGAN on the Simulated Data. We have prepared the required input dataset which you can find in the Simulated_Data folder.
We created a command-line interface for scAEGAN that allows it to be run in a high-performance computing environment. Because scAEGAN is built with tensorflow/keras, we recommend running it on GPUs to significantly reduce run-time. It has been tested on Linux and OS X platforms.
The experiments were performed on a Linux server using an Intel Xeon CPU E5-2680 v4 @ 2.40GHz processor with 128 GB RAM and an NVIDIA Tesla V100 GPU.
For model training and evaluation, a vignette presents an example how to run the scAEGAN and carry out the benchmarking using the Evaluation folder scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
AE_Concatenated		AE_Concatenated
Evaluation		Evaluation
Example		Example
Real_Data		Real_Data
Simulated_Data		Simulated_Data
scAEGAN		scAEGAN
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scAEGAN- Unification of Single-Cell Genomics Data by Adversarial Learning of Latent Space Correspondences

Summary

scAEGAN Workflow

Installation Requisites

Datasets

Usage

Running Example

About

Releases

Packages

Languages

License

munozdjp/scAEGAN

Folders and files

Latest commit

History

Repository files navigation

scAEGAN- Unification of Single-Cell Genomics Data by Adversarial Learning of Latent Space Correspondences

Summary

scAEGAN Workflow

Installation Requisites

Datasets

Usage

Running Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages