-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.md.bak
71 lines (64 loc) · 3.4 KB
/
README.md.bak
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# A. fumigatus LncRNAs
Finding LncRNAs in Aspergillus fumigatus

##Workflow for finding LncRNAs in A. fumigatus (strains A1163 and AF293)
Two work flows implemented:<br>
###INTERSECT: LncRNAs have to be deemed non-coding by both the CPC2 an FEELnc algorithms<br>
Requires usage of: section_5_for_intersect_pipe_variation/CPC2andFEELNC.py<br>
in place of: section_5_coding_potential_assessment /CPC2andFEELNC_UNION.py<br>
###UNION: LncRNAs have to be deemed non-coding by either the CPC2 an FEELnc algorithms
##Samples:
**A1163** candidate LncRNAs were extracted from 44 paired-end RNA-seq runs of *Aspergillus fumigatus* (strain A1163) exposed to various drug regimens from the Prof. M. Bromley’s Group (University of Manchester).<br>
**AF293** samples were retrieved from the Sequence Read Archive (SRA) https://www.ncbi.nlm.nih.gov/sra<br>
<br>
Scheme makes four different StringTie merged assemblies from the same sample set using different initial StringTie assembly parameters and on all or half the data set, and then merges. This seemed the most conducive to LncRNA retrieval. Novel protein-coding genes were recovered simultaneously.<br>Workflow implementation capitalises on code and concepts used to uncover LncRNAs in Candida which used the merge of many runs to reveal the RNAPolII LncRNAs.<br>
Hovhannisyan, H., Gabaldón, T. The long non-coding RNA landscape of Candida yeast pathogens. Nat Commun 12, 7317 (2021). https://doi.org/10.1038/s41467-021-27635-4 https://github.com/Gabaldonlab/lncRNAs
###Outputs of Pipelines: **Data_From_PIPES**
###Detailed workflow for the A1163-UNION pipeline: LncRNA_retrieval_Pipeline.pdf
Sections referred to are described in this document
### For python programs require:<br>
- python (3.9.9 )
- pandas 1.4.2
- argparse 1.1
- numpy 1.22.3
- csv
- sys 3.9.9<br>
<br>
### External supplementary programs:<br>
- hisat2_extract_splice_sites.py https://github.com/DaehwanKimLab/hisat2/blob/master/hisat2_extract_splice_sites.py
- non_cod.py https://github.com/Gabaldonlab/lncRNAs/blob/main/mapping_assembly_lncRNA_prediction/non_cod.py
- select_longer_200.py https://github.com/Gabaldonlab/lncRNAs/blob/main/mapping_assembly_lncRNA_prediction/select_longer_200.py
<br>
### 2. - Quality checks of mapped reads and StringTie transcriptome
- **Analysed in: A1163_analysis_StringTie_Transcriptomes-GH.pdf** <br>
- section_2.1_polycistron:<br>
- get_longest_pt1_v2.sh<br>
- get_longest_pt2.sh<br>
- get_longest_A1163_pt3.sh<br>
- get_longest.py<br>
- **section_2.2_sensitivity_precision**
- gff_compare_polycistronc_A1163.sh
- **section_2.3_Intron_Overlaps**
- bed_intersect_intron.sh
- get_intron_beds.A1163.sh
- get_intron_overlap_stats.sh
- intron_bed_making_get_stats_A1163.py
- intron_bed_making_just_beds1.py
- **section_5_coding_potential_assessment**
- CPC2andFEELNC_UNION.py
- sortgtf_no_coding_noRFAM_23_7.py
- Removing_BLAST_HITS_for_new_filter_revamped_4_8.py
- **section_6_ LncRNA retrieval**
- non_codU.py
- non_codX.py
- collapse_to_longest.py
- **section_7_novel_coding_retrieval**
- Making_Final_bits_for_pot_protein_coding.py
- **Jupyter notebooks** used for analysis:
- A1163_analysis_StringTie_Transcriptomes-GH.pdf
- A1163_GC_content_features_GH.pdf
- A1163_genome_coverage_GH.pdf
- A1163_LNCRNA_cluster_genome_50000_GH.pdf
- A1163_UNION_after_TPM_GC-CONTENT_Lengths_GH.pdf
- A1163_UNION_exon_counting.pdf
- A1163_UNION_TPM_GH.pdf