The first python script I write, to implememt the length summary and statistics of a fasta file (genome).
can be used like :
python N50.py fasta [10|20|30..|90] ## if the last argument is missing, then a whole summary is generated.
Pseudochromosome assembly with a chromosome-level genome reference of a very close related species.
This pipeline is based a pipeline which graphically display the homogenicity between two genomes. I went one step further to improove the genome whose average length is shorter according the genome whose is longer.
This pipe require a usable MUMmer and works in this order:
nucmer -p prefix ref query ### ref and query represents two genome sequnce files, the only two oringal inputs
perl delta2list.pl prefix.delta ref.size query.size > ref_query.list
perl mcl_one_cluster.pl ref_query.list > ref_query.list.mcl
perl cat_pseudochromosome.pl ref_query.list.mcl ref.size query improoved.results
There are three steps.
- Mapping Query genome to reference genome using nucmer in MUMmer package, and extract information from DELTA format outputs.
- Mergeing homologous blocks. The rule is below and three involved threshold need to be customerized in mcl_one_cluster.pl is descripted:
a. sort original alignment blocks acoording to ref coordinates and then query coordinates.
b. merge blocks in a cluster if d and D is smaller than threshold [-dis_cutof].
c. if both ΣR and ΣQ of this cluster is bigger than threshold [-cover_cutof], and number of alignments in this block (n) is bigger than threshold [-cluster],then this whole cluster passes this filtering. - Output a svg file which display these clusters in physical order. In this SVG.pl a parameter can be set to adjust canvas size. Genome size/parameter=canvas width or length in pixel.
shell commands should run like this example. Please download involved perl scripts in this folder by hand.
## mapping, make sure you have nucmer ready to go.
nucmer -c 100 -p prefix ref_genome query_genome
perl delta2list.pl prefix.delta ref.size query.size > ref_query.list
## merge cluster and filtering
perl mcl_one_cluster.pl [options] ref_query.list > ref_query.list.mcl
## sort according to coordinates and draw SVG.
perl order.pl ref_query.list.mcl ref.size query.size > ref_query.list.mcl.order
perl SVG.pl ref_query.list ref.size ref_query.list.mcl.order ref_query.list.mcl.svg 50000
Correct small InDels in a genome, using cDNA, clean reads in fastq format or reliable contigs assembled using NGS data
Third generation sequencing tech facilitates complex genome assemble, and improoved the quality of many genome.
However genomes assembled from these erronuos long reads inherited the drawbacks of long reads.
We observed the small InDel occurency is about one in every 100-1000 bp.
Many error correction applications are developed, mostly based on mapping NGS data to genome, which may cost weeks.
This script calls small InDels from blast results and output a corrected results.
This scripts can also be used to generate a consensus sequence from multi similiar sequences.
Properly alter blast2consensus.config before you run:
sh blast2consensus.sh <1|2|3|4> ### 1,2,3,4 represents four steps of this script
This scripts lists candidate sgRNAs. Further selection is needed to shorten the list, according to GC content and other restrictions. An example is given below, sg_RNA_length is the length of sgRNA, while query and genome is the interested gene sequence and the reference genome sequence respectively. A file named query_sg_RNA_length.sg_RNA.targs will be generated, storing all possible sgRNAs.
perl crispr.sgRNA.finder.pl query genome sg_RNA_length