Hepatocystis genome (Aunin et al.)

Data and code relating to the Hepatocystis ex. Piliocolobus tephrosceles genome and transcriptome paper

Primary genome assembly and annotation files

PRJEB32891_scaffolds.fasta.gz - assembly sequence in fasta format

PRJEB32891_union.embl.gz - assembly and annotation in single EMBL record (union)

PRJEB32891_scaffolds.embl.gz - assembly and annotation in EMBL format

PRJEB32891_proteins.faa - protein sequences of predicted genes in fasta format

PRJEB32891_transcripts.fa - spliced nucleotide sequences of predicted genes in fasta format

Alignments for phylogenetic trees

S1_dataset_genes_for_phylogenetic_trees.xlsx - Table of the IDs of sequences that were used to generate the phylogenetic trees

apicoplast_concat_alignments.faa - alignment of apicoplast sequences

cytochrome_b_alignment.fa - cytochrome b alignment

11_genes_with_hepatocystis_epomophori_concat_alignments.fa - 11 nuclear gene alignment

mitoch_proteins_concat_alignments.faa - alignment of mitochondrial sequences

nuclear_genome_proteins_concat_alignments.faa - alignment of nuclear genes

concatenate_fasta_alignments.py - Script for concatenating protein FASTA alignments of different genes in order to make a species phylogeny tree

Deconvolution of Hepatocystis bulk RNA-seq using Malaria Cell Atlas data and CIBERSORT

merge_htseq-count_output_files.py - Script for merging htseq-count output files of multiple samples from same study into one table

MCA_pseudobulk.R - R code describing how to generate stage-specific pseudobulk samples as a reference for bulk RNA-seq deconvolution

generate_mixtures.py - generate mixtures of pseudobulk life stages at known percentages to test CIBERSORT for accuracy

mca_pseudobulk_meroringschizfilt_cpm.dat - Output from running generate_mixtures.py, e.g. pseudobulk for MCA life stages

Examination of missing genes in Hepatocystis relative to Plasmodium

heps_per_mca_cluster.py - find orthologue groups shared between P. berghei and either P. ovale or P. vivax, but absent from Hepatocystis and see whether Malaria Cell Atlas gene clusters are enriched for these.

python heps_per_mca_cluster.py hepatocystis_orthomcl.out mca_gene_clusters.dat Pberghei.prot.desc

hepatocystis_orthomcl.out - orthoMCL clusters for protein sequences from Hepatocystis DNA assembly (Hepatocystis_DNA), RNA assembly (Hepatocystis_RNA) and various Plasmodium species.

Evolutionary analysis of genes

codeml_batch.py - Wrapper script for running codeml (http://envgen.nox.ac.uk/bioinformatics/docs/codeml.html) as batch. Requires codeml to be installed and in path.

gatk_count_variants_per_sample.py - Script for finding the average number of variants per 10 kb of reference genome in a VCF file (derived using GATK by merging GVCF files of multiple samples)

gatk_count_variants_per_sample_sliding_window.py - Script for counting the number of variants per fixed length segments of the reference genome (default: 100 kb) in a VCF file using a sliding window. Input: a VCF file derived using GATK by merging GVCF files of multiple samples. The script assumes that the assembly that the reads were mapped to for variant calling was concatenated into one pseudochromosome before mapping. Output: a CSV file where the rows correspond to samples and the columns correspond to genome sequence bins.

mca_gene_clusters.dat - File describing which genes are in which Malaria Cell Atlas clusters

hep_pberghei_povale_3-way_codeml_dn_results.txt - Results of codeml dN analysis

hepatocystis_dn_and_mca_clusters.Rmd - This R Markdown file contains the R code for generating Figure S12 from the manuscript on the draft genome of Hepatocystis sp. ex Piliocolobus tephrosceles

crop_translatorx_alignments.py - Script for processing TranslatorX alignments to remove badly aligned parts (similarly to what Gblocks does)

hep_pfam_domains_in_top_dn.py - Script for checking for the enrichment of PFAM domains in the Hepatocystis proteins with top dN values

Analysis of Haemoproteus tartakovskyi genome

hepatocystis_orthomcl_with_haemoproteus.out - Output file of an OrthoMCL run that includes the proteome of Haemoproteus tartakovskyi in addition to the proteomes of Hepatocystis and Plasmodium species

htartakovskyi_proteins_companion.faa - Proteins of Haemoproteus tartakovskyi, annotated using Companion software

haemoproteus_tartakovskyi_companion_annotations.embl - Sequence and Companion annotation of Haemoproteus tartakovskyi genome

Processing of 10X Chromium sequencing reads

chromium_reads_remove_barcodes.cpp - C++ program to remove barcodes and linkers from FASTQ files of Chromium reads

chromium_reads_barcode_frequencies.cpp - C++ program for finding frequencies of barcodes in FASTQ files of Chromium reads

chromium_reads_extract_barcodes.cpp - C++ program for extracting Chromium barcode sequences of Chromium reads that have been selected for assembly

Sequence analysis

fasta_f.py - Multitool script for processing FASTA files. It collects various functions for performing different operations: breaking scaffolds into contigs, filtering sequences by length, finding the frequency of stop codons, checking the completeness of transcripts, extracting or removing sequences by name, batch editing of headers, getting the GC and length distribution of sequences and truncating or deduplicating the sequences

protein_motif_search.py - Script for detecting motifs in protein FASTA sequences

split_union_embl.py - Script for splitting a union EMBL file (that has been made from scaffolds EMBL file using 'EMBOSS union') back into EMBL files of individual scaffolds

detect_low_complexity_contaminant_contigs.R - Script that was used to detect low complexity contaminant contigs in the Hepatocystis genome assembly

hep_proteins_pfam_domains.txt - List of PFAM domains in Hepatocystis proteins

Other

hep_project_shared_functions.py - File for functions that are shared between many scripts in the Hepatocystis project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hepatocystis genome (Aunin et al.)

Primary genome assembly and annotation files

Alignments for phylogenetic trees

Deconvolution of Hepatocystis bulk RNA-seq using Malaria Cell Atlas data and CIBERSORT

Examination of missing genes in Hepatocystis relative to Plasmodium

Evolutionary analysis of genes

Analysis of Haemoproteus tartakovskyi genome

Processing of 10X Chromium sequencing reads

Sequence analysis

Other

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
11_genes_with_hepatocystis_epomophori_concat_alignments.fa		11_genes_with_hepatocystis_epomophori_concat_alignments.fa
LICENSE		LICENSE
MCA_pseudobulk.R		MCA_pseudobulk.R
PRJEB32891_proteins.faa		PRJEB32891_proteins.faa
PRJEB32891_scaffolds.embl.gz		PRJEB32891_scaffolds.embl.gz
PRJEB32891_scaffolds.fasta.gz		PRJEB32891_scaffolds.fasta.gz
PRJEB32891_transcripts.fa		PRJEB32891_transcripts.fa
PRJEB32891_union.embl.gz		PRJEB32891_union.embl.gz
Pberghei.prot.desc		Pberghei.prot.desc
README.md		README.md
S1_dataset_genes_for_phylogenetic_trees.xlsx		S1_dataset_genes_for_phylogenetic_trees.xlsx
apicoplast_concat_alignments.faa		apicoplast_concat_alignments.faa
chromium_reads_barcode_frequencies.cpp		chromium_reads_barcode_frequencies.cpp
chromium_reads_extract_barcodes.cpp		chromium_reads_extract_barcodes.cpp
chromium_reads_remove_barcodes.cpp		chromium_reads_remove_barcodes.cpp
codeml_batch.py		codeml_batch.py
concatenate_fasta_alignments.py		concatenate_fasta_alignments.py
crop_translatorx_alignments.py		crop_translatorx_alignments.py
cytochrome_b_alignment.fa		cytochrome_b_alignment.fa
detect_low_complexity_contaminant_contigs.R		detect_low_complexity_contaminant_contigs.R
fasta_f.py		fasta_f.py
gatk_count_variants_per_sample.py		gatk_count_variants_per_sample.py
gatk_count_variants_per_sample_sliding_window.py		gatk_count_variants_per_sample_sliding_window.py
generate_mixtures.py		generate_mixtures.py
haemoproteus_tartakovskyi_companion_annotations.embl		haemoproteus_tartakovskyi_companion_annotations.embl
hep_pberghei_povale_3-way_codeml_dn_results.txt		hep_pberghei_povale_3-way_codeml_dn_results.txt
hep_project_shared_functions.py		hep_project_shared_functions.py
hep_proteins_pfam_domains.txt		hep_proteins_pfam_domains.txt
hepatocystis_dn_and_mca_clusters.Rmd		hepatocystis_dn_and_mca_clusters.Rmd
hepatocystis_orthomcl.out		hepatocystis_orthomcl.out
hepatocystis_orthomcl_with_haemoproteus.out		hepatocystis_orthomcl_with_haemoproteus.out
hepatocystis_pfam_domains_in_top_dn.py		hepatocystis_pfam_domains_in_top_dn.py
heps_per_mca_cluster.py		heps_per_mca_cluster.py
htartakovskyi_proteins_companion.faa		htartakovskyi_proteins_companion.faa
mca_gene_clusters.dat		mca_gene_clusters.dat
mca_pseudobulk_meroringschizfilt_cpm.dat		mca_pseudobulk_meroringschizfilt_cpm.dat
merge_htseq-count_output_files.py		merge_htseq-count_output_files.py
mitoch_proteins_concat_alignments.faa		mitoch_proteins_concat_alignments.faa
nuclear_genome_proteins_concat_alignments.faa		nuclear_genome_proteins_concat_alignments.faa
protein_motif_search.py		protein_motif_search.py
split_union_embl.py		split_union_embl.py

License

adamjamesreid/hepatocystis-genome

Folders and files

Latest commit

History

Repository files navigation

Hepatocystis genome (Aunin et al.)

Primary genome assembly and annotation files

Alignments for phylogenetic trees

Deconvolution of Hepatocystis bulk RNA-seq using Malaria Cell Atlas data and CIBERSORT

Examination of missing genes in Hepatocystis relative to Plasmodium

Evolutionary analysis of genes

Analysis of Haemoproteus tartakovskyi genome

Processing of 10X Chromium sequencing reads

Sequence analysis

Other

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages