Finding LncRNAs in Aspergillus fumigatus
INTERSECT: LncRNAs have to be deemed non-coding by both the CPC2 an FEELnc algorithms
Requires usage of: section_5_for_intersect_pipe_variation/CPC2andFEELNC.py
In place of: section_5_coding_potential_assessment /CPC2andFEELNC_UNION.py
UNION: LncRNAs have to be deemed non-coding by either the CPC2 an FEELnc algorithms
A1163 candidate LncRNAs were extracted from 44 paired-end RNA-seq runs of Aspergillus fumigatus (strain A1163) exposed to various drug regimens from the Prof. M. Bromley’s Group (University of Manchester).
AF293 samples were retrieved from the Sequence Read Archive (SRA) https://www.ncbi.nlm.nih.gov/sra
Scheme makes four different StringTie merged assemblies from the same sample set using different initial StringTie assembly parameters and on all or half the data set, and then merges. This seemed the most conducive to LncRNA retrieval. Novel protein-coding genes were recovered simultaneously.
Workflow implementation capitalises on code and concepts used to uncover LncRNAs in Candida which used the merge of many runs to reveal the RNAPolII LncRNAs.
Hovhannisyan, H., Gabaldón, T. The long non-coding RNA landscape of Candida yeast pathogens. Nat Commun 12, 7317 (2021).
https://doi.org/10.1038/s41467-021-27635-4 https://github.com/Gabaldonlab/lncRNAs
I would like to thank Harry Chown for all his help on this project
GTF and Fasta files from before expression cut-off and after for all pipelines for both candidate LncRNAs and novel protein coding.
Sections referred to are described in this document
python(3.9.9), pandas(1.4.2), argparse(1.1),numpy(1.22.3), csv, sys(3.9.9)
hisat2_extract_splice_sites.py https://github.com/DaehwanKimLab/hisat2/blob/master/hisat2_extract_splice_sites.py non_cod.py https://github.com/Gabaldonlab/lncRNAs/blob/main/mapping_assembly_lncRNA_prediction/non_cod.py select_longer_200.py https://github.com/Gabaldonlab/lncRNAs/blob/main/mapping_assembly_lncRNA_prediction/select_longer_200.py
2. - Quality checks of mapped reads and StringTie transcriptome Analysed in: A1163_analysis_StringTie_Transcriptomes-GH.pdf
section_2.1_polycistron:
get_longest_pt1_v2.sh
get_longest_pt2.sh
get_longest_A1163_pt3.sh
get_longest.py
section_2.2_sensitivity_precision
gff_compare_polycistronc_A1163.sh
section_2.3_Intron_Overlaps
bed_intersect_intron.sh get_intron_beds.A1163.sh get_intron_overlap_stats.sh intron_bed_making_get_stats_A1163.py intron_bed_making_just_beds1.py
section_5_coding_potential_assessment
CPC2andFEELNC_UNION.py sortgtf_no_coding_noRFAM_23_7.py Removing_BLAST_HITS_for_new_filter_revamped_4_8.py
section_6_ LncRNA retrieval
non_codU.py non_codX.py collapse_to_longest.py
section_7_novel_coding_retrieval
Making_Final_bits_for_pot_protein_coding.py
Jupyter notebooks used for analysis:
A1163_analysis_StringTie_Transcriptomes-GH.pdf
A1163_GC_content_features_GH.pdf
A1163_genome_coverage_GH.pdf
A1163_LNCRNA_cluster_genome_50000_GH.pdf
A1163_UNION_after_TPM_GC-CONTENT_Lengths_GH.pdf
A1163_UNION_exon_counting.pdf
A1163_UNION_TPM_GH.pdf