-
Notifications
You must be signed in to change notification settings - Fork 133
BioAlcidae
##Motivation
javascript version of awk for bioinformatics
##Compilation
- java 1.8 http://www.oracle.com/technetwork/java/index.html (NOT the old java 1.7 or 1.6) . Please check that this java is in the
${PATH}
. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 ) - GNU Make > 3.81
- curl/wget
- git
- apache ant is only required to compile htsjdk
- xsltproc http://xmlsoft.org/XSLT/xsltproc2.html
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make bioalcidae
by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes
on the command line:
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make bioalcidae standalone=yes
The required libraries will be downloaded and installed in the dist
directory.
The a file local.mk can be created edited to override/add some paths.
For example it can be used to set the HTTP proxy:
http.proxy.host=your.host.com
http.proxy.port=124567
##Synopsis
$ java -jar dist/bioalcidae.jar [options] (stdin|file)
- -o|--output (OUTPUT-FILE) Output file. Default:stdout
- -f|--jsfile (SCRIPT.JS) Javascript file. Use either javascript file or javascript expression.
- -e|--jsexpr (SCRIPT) Javascript expression. Use either javascript file or javascript expression.
- -F|--format (FORMAT) force format: one of VCF BAM SAM FASTQ FASTA
- -h|--help print help
- -version|--version show version and exit
##Source Code
Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/bioalcidae/BioAlcidae.java
Bioinformatics file javascript-based reformatter ( java engine http://openjdk.java.net/projects/nashorn/ ). Something like awk for VCF, BAM, SAM, FASTQ, FASTA etc...
The program injects the following variables:
- out a java.io.PrintWriter ( https://docs.oracle.com/javase/7/docs/api/java/io/PrintWriter.html ) for output
- FILENAME a string, the name of the current input
- format a string, the format of the current input ("VCF"...)
for VCF , the program injects the following variables:
- header a htsjdk.variant.vcf.VCFHeader https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFHeader.html
- iter a java.util.Iterator<htsjdk.variant.variantcontext.VariantContext> https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html
- iter a java.util.Iterator
public class Fasta
{
public String getSequence();
public String getName();
public void print();
public int getSize();
public char charAt(int i);
}
- header a htsjdk.samtools.SAMFileHeader http://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMFileHeader.html
- iter a htsjdk.samtools.SAMRecordIterator https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMRecordIterator.html
- iter a java.util.Iterator<htsjdk.samtools.fastq.FastqRecord> https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/fastq/FastqRecord.html
getting an histogram of the length of the reads
L={};
while(iter.hasNext()) {
var rec=iter.next();
if( rec.getReadUnmappedFlag() || rec.isSecondaryOrSupplementary()) continue;
var n= rec.getReadLength();
if(n in L) {L[n]++;} else {L[n]=1;}
}
for(var i in L) {
out.println(""+i+"\t"+L[i]);
}
"Creating a consensus based on 'x' number of fasta files" ( https://www.biostars.org/p/185162/#185168)
$ echo -e ">A_2795\nTCAGAAAGAACCTC\n>B_10\nTCAGAAAGCACCTC\n>C_3\nTCTGAAAGCACTTC" |\
java -jar ~/src/jvarkit-git/dist/bioalcidae.jar -F fasta -e 'var consensus=[];while(iter.hasNext()) { var seq=iter.next();out.printlnseq.name+"\t"+seq.sequence);for(var i=0;i< seq.length();++i) {while(consensus.length <= i) consensus.push({}); var b = seq.charAt(i);if(b in consensus[i]) {consensus[i][b]++;} else {consensus[i][b]=1;} } } out.print("Cons.\t"); for(var i=0;i< consensus.length;i++) {var best=0,base="N"; for(var b in consensus[i]) {if(consensus[i][b]>best) { best= consensus[i][b];base=b;}} out.print(base);} out.println();'
A_2795 TCAGAAAGAACCTC
B_10 TCAGAAAGCACCTC
C_3 TCTGAAAGCACTTC
Cons. TCAGAAAGCACCTC
Reformating a VCF we want to reformat a VCF with header
CHROM POS REF ALT GENOTYPE_SAMPLE1 GENOTYPE_SAMPLE2 ... GENOTYPE_SAMPLEN
we use the following javascript file:
var samples = header.sampleNamesInOrder;
out.print("CHROM\tPOS\tREF\tALT");
for(var i=0;i< samples.size();++i)
{
out.print("\t"+samples.get(i));
}
out.println();
while(iter.hasNext())
{
var ctx = iter.next();
if(ctx.alternateAlleles.size()!=1) continue;
out.print(ctx.chr +"\t"+ctx.start+"\t"+ctx.reference.displayString+"\t"+ctx.alternateAlleles.get(0).displayString);
for(var i=0;i< samples.size();++i)
{
var g = ctx.getGenotype(samples.get(i));
out.print("\t");
if(g.isHomRef())
{
out.print("0");
}
else if(g.isHomVar())
{
out.print("2");
}
else if(g.isHet())
{
out.print("1");
}
else
{
out.print("-9");
}
}
out.println();
}
$ curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" | \
gunzip -c | java -jar ./dist/bioalcidae.jar -f jeter.js -F vcf | head -n 5 | cut -f 1-10
CHROM POS REF ALT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
22 16050075 A G 0 0 0 0 0 0
22 16050115 G A 0 0 0 0 0 0
22 16050213 C T 0 0 0 0 0 0
22 16050319 C T 0 0 0 0 0 0
for 1000 genome data, print CHROM/POS/REF/ALT/AF(europe):
$ curl "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf.gz" | gunzip -c |\
java -jar dist/bioalcidae.jar -F VCF -e 'while(iter.hasNext()) {var ctx=iter.next(); if(!ctx.hasAttribute("EUR_AF") || ctx.alternateAlleles.size()!=1) continue; out.println(ctx.chr+"\t"+ctx.start+"\t"+ctx.reference.displayString+"\t"+ctx.alternateAlleles.get(0).displayString+"\t"+ctx.getAttribute("EUR_AF"));}'
1 10177 A AC 0.4056
1 10235 T TA 0
1 10352 T TA 0.4264
1 10505 A T 0
1 10506 C G 0
1 10511 G A 0
1 10539 C A 0.001
1 10542 C T 0
1 10579 C A 0
1 10616 CCGCCGTTGCAAAGGCGCGCCG C 0.994
(...)
- https://github.com/lh3/bioawk
- https://www.biostars.org/p/152016/
- https://www.biostars.org/p/152720/
- https://www.biostars.org/p/152820/
- https://www.biostars.org/p/153060/
- https://www.biostars.org/p/183197
- https://www.biostars.org/p/185162/#185168
- Issue Tracker: http://github.com/lindenb/jvarkit/issues
- Source Code: http://github.com/lindenb/jvarkit
The project is licensed under the MIT license.
Should you cite bioalcidae ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030