ArtificialGenes

This repository contains code for using generative deep learning to create sequences of artificial DNA
The work in this repository is expanding on the work in the paper below and reuses some code: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009303

To run the original vanilla GAN from that paper:

slurm/tf_gan.sh

To run a pytorch implementation of the vanilla GAN:

slurm/run_gan.sh input_file output_dir 200

To run a Wasserstein GAN with Gradient Penalty:

slurm/run_wgan.sh input_file output_dir 200

To run a Conditional GAN:

slurm/run_cgan.sh populations populations_output 200

To parse data from the 1000 genomes dataset (consult script for all parsing options):

python3 src/vcf_data_extractor.py --chrom_list 14,15,16 --odir multi_chrom_output --population YRI

To convert simulated data to .csv data:

python3 src/simulated_data_extractor.py --idir simulated_data --odir unzipped_data

To format simulated .csv data to your preferred specifications:

python3 src/simulated_data_processing.py --idir unzipped_data --length 1000 --ofile full_simulated_data.csv

Ideas for next steps:

Find highly divergent data and train a CGAN to successful generate samples of each class
Train Controllable GAN to convert input sequence dataset to dataset with selective sweep
Use Transformer model to convert input sequence data to data with selective sweep
Explore Progressive Growing as a strategy to stably generate long, high-quality sequence data (this might not work)
Use Sparse Transformers to generate extremely long sequence data

Disclaimer: Some files in the "1000G_real_genomes" directory were taken directly from the code corresponding to this paper: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009303

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
1000G_real_genomes		1000G_real_genomes
outputs		outputs
slurm		slurm
src		src
README.md		README.md
converter.sh		converter.sh

Provide feedback