Python scripts for manipulating various genomics-related file formats.
Some definitions are included in multiple scripts so they are as standalone
as possible.
bedIntersect2percentOverlap.py
: output proportion of features participating in an overlap as found by bedtools bedIntersect -wa -wbgff2bed.py
: convert a GFF3 file to a BED format file
blast2gff
: convert blastn, blastp, etc. tabular output to GFF3 formatblastBestHit.py
: output the highest-scoring hit from blastn, blastp, etc. tabular output (-outfmt 6 or 7)blastFilter.py
: output lines from blastn, blastp, etc. tabular output (-outfmt 6 or 7) where the percent identity satisfies constraints
coverage2circosLine.py
: calculate average depth of coverage form the output of bedtools genomecov -ibam -d and output a Circos line trackfasta2GCcontentCircosHeatmap.py
: calculate GC content for each window in each sequence in a FASTA file and output a Circos heatmap trackfixTrackLabels.py
: replace labels in Circos track file with the integer label from the associated Circos karyotype filegff2circosHeatmap.py
: convert feature coordinates in a GFF3 file to Circos heatmap track format with specified bin sizegff2circosTile.py
: convert features in a GFF3 file to Circos tile track formatvcfSNPrate2circosLine.py.untested
: takes a VCF file with or without a GFF3 file whose features (genes) coordinates are represented in the VCF file and outputs SNPs rate per gene or a Circos heatmap track of SNP rate/bin size
fasta2circosIdeograms
: output sequence lengths as a Circos ideogram filefasta2GCcontentCircosHeatmap.py
: calculate GC content for each window in each sequence in a FASTA file and output a Circos heatmap trackfastaExtractSeqs.py
: extract a subset of the sequences in a FASTA filefastaExtractNseqs.py
: extract the first or second or third etc. n sequences from a FASTA filefastaRenameSeqs.py
: rename FASTA sequence headers according to a mapping of old to new namesfastaRenameSeqsByLength.py
: sort FASTA sequences in descending order and rename sequences sequentiallyfastaSplitSeqs.py
: write a new FASTA file for each sequence in a FASTA filegff2fasta.py
: extract sequences from a FASTA file based on coordinates in a GFF3 file, using the value from a specified key in the GFF3 attributes column
blast2gff
: convert blastn, blastp, etc. tabular output to GFF3 formatgff2bed.py
: convert a GFF3 file to a BED format filegff2circosHeatmap.py
: convert feature coordinates in a GFF3 file to Circos heatmap track format with specified bin sizegff2circosTile.py
: convert features in a GFF3 file to Circos tile track formatgff2fasta.py
: extract sequences from a FASTA file based on coordinates in a GFF3 file using the value from a specified key in the GFF3 attributes column as the sequence name. Depends on BEDTools and BioPythongff2introns.py
: create a GFF3 with intron features from a GFF3 with gene and exon features or output a list of intron lengthsgff3line.py
: contains the GFF3_classgffAddAttribute.py
: add a key-value pair to the attributes column of a GFF3 filegffFilter.py
: remove or retain GFF3 features on specified scaffolds or with specified values for the ID attributegffMergeOverlaps.py
: merge overlapping features in a GFF3 filegffRemoveScafPart.py
: remove features in a GFF3 file whose coordinatesgffRenameScafs.py
: rename scaffolds in a GFF3 file per a two-column mapgffSubset.py
: extracts a subset of a GFF3 file based on values of a chosen attribute keygffSubsetLTRdigest.py
: extracts feature blocks from a LTRharvest/LTRdigest GFF3gffv2Exonerate2gff3.py
: convert an Exonerate-generated GFF2 file to GFF3 format
vcfSNPrate2circosLine.py.untested
: takes a VCF file with or without a GFF3 file whose features (genes) coordinates are represented in the VCF file and outputs SNPs rate per gene or a Circos heatmap track of SNP rate/bin size
meanMedianMinMax.py
: takes input of a list of numbers and outputs the mean, median, minimum value, maximum value, and sum totalrepeatMaskerGFFsubset
: takes input of RepeatMasker GFF and writes lines from several categories each into their own filerepeatMaskerGFFsummarize
: writes tables with summarized counts and lengths of features in a RepeatMasker-derived GFF3