-
Notifications
You must be signed in to change notification settings - Fork 133
VcfDerby01
##Motivation
Insert similar VCFs into an Apache Derby Database
##Compilation
- java 1.8 http://www.oracle.com/technetwork/java/index.html (NOT the old java 1.7 or 1.6) . Please check that this java is in the
${PATH}
. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 ) - GNU Make > 3.81
- curl/wget
- git
- apache ant is only required to compile htsjdk
- xsltproc http://xmlsoft.org/XSLT/xsltproc2.html
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make vcfderby01
by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes
on the command line:
$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make vcfderby01 standalone=yes
The required libraries will be downloaded and installed in the dist
directory.
The a file local.mk can be created edited to override/add some paths.
For example it can be used to set the HTTP proxy:
http.proxy.host=your.host.com
http.proxy.port=124567
##Synopsis
$ java -jar dist/vcfderby01.jar [options] (stdin|file)
- -o|--output (OUTPUT-FILE) Output file. Default:stdout.
- -d|--derby (VALUE) REQUIRED. path to Derby database storage directory. Default value : "".
- -a|--action (VALUE) REQUIRED. action to perform. 'read': read a zip or a concatenated stream of vcf files and insert it into a derby database. 'list': list the available vcf. 'dump' dump one or more VCF. 'dumpall' dump all VCFs. Default value : "".
- -h|--help print help
- -version|--version show version and exit
##Source Code
Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/burden/VcfDerby01.java
Input VCF The tool is optimized for storing very similar VCF files into an apache derby database , for example a big VCF file which would have been splitted into one VCF per transcript.
Schema At the time of writing this document, the current schema is:
CREATE TABLE ROWCONTENT(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,MD5SUM CHAR(32) UNIQUE,CONTENT CLOB,CONTIG VARCHAR(20),START INT,STOP INT,ALLELE_REF VARCHAR(50));
CREATE TABLE VCF(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,NAME VARCHAR(255));
CREATE TABLE VCFROW(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,VCF_ID INTEGER CONSTRAINT row2vcf REFERENCES VCF,ROW_ID INTEGER CONSTRAINT row2content REFERENCES ROWCONTENT);
The database is created the first time the database is created. It can be a slow process. The embedded database is local and can be removed by a simple
rm -rf database.db
Inserting VCFs into the database Inserting one VCF:
$ java -jar dist/vcfderby01.jar -a read -d database.db input.vcf input2.vcf.gz
#ID NAME
1 input.vcf
2 input2.vcf.gz
You can insert a VCF any number of times:
$ java -jar dist/vcfderby01.jar -a read -d database.db input.vcf input.vcf input2.vcf.gz
#ID NAME
3 input.vcf
4 input2.vcf.gz
The program also accepts concatenated VCF files:
$ gunzip -c input2.vcf.gz input2.vcf.gz input2.vcf.gz input2.vcf.gz | java -jar dist/vcfderby01.jar -a read -d database.db 2> /dev/null
#ID NAME
5 vcf1461860749175
6 vcf1461860749176
7 vcf1461860749177
8 vcf1461860749178
Listing the available VCFs
$ java -jar dist/vcfderby01.jar -d database.db -a list
#ID NAME COUNT_VARIANTS
1 input.vcf 35
2 input2.vcf.gz 35
3 input.vcf 35
4 input2.vcf.gz 35
5 vcf1461860749175 35
6 vcf1461860749176 35
7 vcf1461860749177 35
8 vcf1461860749178 35
Export one or more VCFs by ID if more that one ID is given, the output is a stream of concatenated VCF.
$ java -jar dist/vcfderby01.jar -d database.db -a dump 5 8 3 2> /dev/null | grep "CHROM"
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
if the output name ends with '*.zip', each VCF is saved in the zip as a new entry.
$ java -jar dist/vcfderby01.jar -d database.db -a dump -o out.zip 5 8 3
$ unzip -t out.zip
Archive: out.zip
testing: input.vcf.vcf OK
testing: vcf1461860749175.vcf OK
testing: vcf1461860749178.vcf OK
No errors detected in compressed data of out.zip.
Dumping all VCFs In a zip:
$ java -jar dist/vcfderby01.jar -d database.db -a dumpall -o out.zip
$ unzip -t out.zip
Archive: out.zip
testing: input.vcf.vcf OK
testing: input2.vcf.gz.vcf OK
testing: 00001.ID3.vcf OK
testing: 00001.ID4.vcf OK
testing: vcf1461860749175.vcf OK
testing: vcf1461860749176.vcf OK
testing: vcf1461860749177.vcf OK
testing: vcf1461860749178.vcf OK
No errors detected in compressed data of out.zip.
as a concatenated stream of VCFs:
$ java -jar dist/vcfderby01.jar -d database.db -a dumpall|\
grep CHROM
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1.variant S2.variant4 S3.variant2 S4.variant3
Accessing the database using ij ij is an interactive SQL scripting tool that comes with Derby. The libraries for ij is available in http://mvnrepository.com/artifact/org.apache.derby/derbytools or you can use something like:
sudo apt-get install derby-tools
echo " connect 'jdbc:derby:database.db'; select count(*) from VCF; exit" |
java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar org.apache.derby.tools.ij
ij version 10.12
ij> ij> 1
-----------
8
1 row selected
$ echo " connect 'jdbc:derby:database.db'; select ID,NAME from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar org.apache.derby.tools.ij
ij version 10.12
ij> ij> ID |NAME
--------------------------------------------------------------------------------------------------------------------------------------------
1 |input.vcf
2 |input2.vcf.gz
3 |input.vcf
4 |input2.vcf.gz
5 |vcf1461860749175
6 |vcf1461860749176
7 |vcf1461860749177
8 |vcf1461860749178
8 rows selected
ij>>
Adding a column in the VCF:
$ echo " connect 'jdbc:derby:database.db'; ALTER TABLE VCF ADD COLStatValue DOUBLE DEFAULT -1.0; select * from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar org.apache.derby.tools.ij
ij version 10.12
ij> ij> 0 rows inserted/updated/deleted
ij> ID |NAME |MYSTATVALUE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 |input.vcf |-1.0
2 |input2.vcf.gz |-1.0
3 |input.vcf |-1.0
4 |input2.vcf.gz |-1.0
5 |vcf1461860749175 |-1.0
6 |vcf1461860749176 |-1.0
7 |vcf1461860749177 |-1.0
8 |vcf1461860749178 |-1.0
8 rows selected
Updating the new column MYSTATVALUE:
echo " connect 'jdbc:derby:database.db'; UPDATE VCF SET MyStatValue =999 WHERE ID=1 OR ID=5 OR ID=8; select ID,MyStatValue from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar org.apache.derby.tools.ij
ij version 10.12
ij>
ij> 3 rows inserted/updated/deleted
ij> ID |MYSTATVALUE
------------------------------------
1 |999.0
2 |-1.0
3 |-1.0
4 |-1.0
5 |999.0
6 |-1.0
7 |-1.0
8 |999.0
8 rows selected
When a new column is added in a VCF, it is handled by vcderby01 and the action: 'list'
$ java -jar dist/vcfderby01.jar -d database.db -a list
#ID MYSTATVALUE NAME COUNT_VARIANTS
1 999.0 input.vcf 35
2 -1.0 input2.vcf.gz 35
3 -1.0 input.vcf 35
4 -1.0 input2.vcf.gz 35
5 999.0 vcf1461860749175 35
6 -1.0 vcf1461860749176 35
7 -1.0 vcf1461860749177 35
8 999.0 vcf1461860749178 35
- Issue Tracker: http://github.com/lindenb/jvarkit/issues
- Source Code: http://github.com/lindenb/jvarkit
The project is licensed under the MIT license.
Should you cite vcfderby01 ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md
The current reference is:
http://dx.doi.org/10.6084/m9.figshare.1425030
Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030