Skip to content
Pierre Lindenbaum edited this page Apr 28, 2016 · 6 revisions

##Motivation

Insert similar VCFs into an Apache Derby Database

##Compilation

Requirements / Dependencies

Download and Compile

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make vcfderby01

by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes on the command line:

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make vcfderby01 standalone=yes

The required libraries will be downloaded and installed in the dist directory.

edit 'local.mk' (optional)

The a file local.mk can be created edited to override/add some paths.

For example it can be used to set the HTTP proxy:

http.proxy.host=your.host.com
http.proxy.port=124567

##Synopsis

$ java -jar dist/vcfderby01.jar  [options] (stdin|file) 

Options

  • -o|--output (OUTPUT-FILE) Output file. Default:stdout.
  • -d|--derby (VALUE) REQUIRED. path to Derby database storage directory. Default value : "".
  • -a|--action (VALUE) REQUIRED. action to perform. 'read': read a zip or a concatenated stream of vcf files and insert it into a derby database. 'list': list the available vcf. 'dump' dump one or more VCF. 'dumpall' dump all VCFs. Default value : "".
  • -h|--help print help
  • -version|--version show version and exit

##Source Code

Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/burden/VcfDerby01.java

Input VCF The tool is optimized for storing very similar VCF files into an apache derby database , for example a big VCF file which would have been splitted into one VCF per transcript.

Schema At the time of writing this document, the current schema is:


CREATE TABLE ROWCONTENT(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,MD5SUM CHAR(32) UNIQUE,CONTENT CLOB,CONTIG VARCHAR(20),START INT,STOP INT,ALLELE_REF VARCHAR(50));
CREATE TABLE VCF(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,NAME VARCHAR(255));
CREATE TABLE VCFROW(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,VCF_ID INTEGER CONSTRAINT row2vcf REFERENCES VCF,ROW_ID INTEGER CONSTRAINT row2content REFERENCES ROWCONTENT);

The database is created the first time the database is created. It can be a slow process. The embedded database is local and can be removed by a simple

rm -rf database.db 

Inserting VCFs into the database Inserting one VCF:

$ java -jar dist/vcfderby01.jar -a read -d database.db input.vcf input2.vcf.gz

#ID	NAME
1	input.vcf
2	input2.vcf.gz

You can insert a VCF any number of times:

$ java -jar dist/vcfderby01.jar -a read -d database.db input.vcf input.vcf input2.vcf.gz
#ID	NAME
3	input.vcf
4	input2.vcf.gz

The program also accepts concatenated VCF files:

$ gunzip -c input2.vcf.gz input2.vcf.gz input2.vcf.gz input2.vcf.gz | java -jar dist/vcfderby01.jar -a read -d database.db 2> /dev/null 
#ID	NAME
5	vcf1461860749175
6	vcf1461860749176
7	vcf1461860749177
8	vcf1461860749178

Listing the available VCFs

$ java -jar dist/vcfderby01.jar -d database.db -a list
#ID	NAME	COUNT_VARIANTS
1	input.vcf	35
2	input2.vcf.gz	35
3	input.vcf	35
4	input2.vcf.gz	35
5	vcf1461860749175	35
6	vcf1461860749176	35
7	vcf1461860749177	35
8	vcf1461860749178	35

Export one or more VCFs by ID if more that one ID is given, the output is a stream of concatenated VCF.

$ java -jar dist/vcfderby01.jar -d database.db -a dump 5 8 3 2> /dev/null | grep "CHROM"
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3

if the output name ends with '*.zip', each VCF is saved in the zip as a new entry.

$ java -jar dist/vcfderby01.jar -d database.db -a dump -o out.zip 5 8 3
$ unzip -t out.zip

Archive:  out.zip
    testing: input.vcf.vcf            OK
    testing: vcf1461860749175.vcf     OK
    testing: vcf1461860749178.vcf     OK
No errors detected in compressed data of out.zip.

Dumping all VCFs In a zip:

$ java -jar dist/vcfderby01.jar -d database.db -a dumpall -o out.zip
$ unzip -t out.zip
Archive:  out.zip
    testing: input.vcf.vcf            OK
    testing: input2.vcf.gz.vcf        OK
    testing: 00001.ID3.vcf            OK
    testing: 00001.ID4.vcf            OK
    testing: vcf1461860749175.vcf     OK
    testing: vcf1461860749176.vcf     OK
    testing: vcf1461860749177.vcf     OK
    testing: vcf1461860749178.vcf     OK
No errors detected in compressed data of out.zip.

as a concatenated stream of VCFs:

$ java -jar dist/vcfderby01.jar -d database.db -a dumpall|\
  grep CHROM
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3

Accessing the database using ij ij is an interactive SQL scripting tool that comes with Derby. The libraries for ij is available in http://mvnrepository.com/artifact/org.apache.derby/derbytools or you can use something like:

sudo apt-get install derby-tools

 echo " connect 'jdbc:derby:database.db'; select count(*) from VCF; exit" |
 java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij> ij> 1          
-----------
8          

1 row selected

$ echo " connect 'jdbc:derby:database.db'; select ID,NAME from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij> ij> ID         |NAME                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------
1          |input.vcf                                                                                                                       
2          |input2.vcf.gz                                                                                                                   
3          |input.vcf                                                                                                                       
4          |input2.vcf.gz                                                                                                                   
5          |vcf1461860749175                                                                                                                
6          |vcf1461860749176                                                                                                                
7          |vcf1461860749177                                                                                                                
8          |vcf1461860749178                                                                                                                

8 rows selected
ij>>

Adding a column in the VCF:


$ echo " connect 'jdbc:derby:database.db'; ALTER TABLE VCF ADD COLStatValue DOUBLE DEFAULT -1.0; select * from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij> ij> 0 rows inserted/updated/deleted
ij> ID         |NAME                                                                                                                            |MYSTATVALUE             
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
1          |input.vcf                                                                                                                       |-1.0                    
2          |input2.vcf.gz                                                                                                                   |-1.0                    
3          |input.vcf                                                                                                                       |-1.0                    
4          |input2.vcf.gz                                                                                                                   |-1.0                    
5          |vcf1461860749175                                                                                                                |-1.0                    
6          |vcf1461860749176                                                                                                                |-1.0                    
7          |vcf1461860749177                                                                                                                |-1.0                    
8          |vcf1461860749178                                                                                                                |-1.0                    

8 rows selected

Updating the new column MYSTATVALUE:

echo " connect 'jdbc:derby:database.db'; UPDATE VCF SET MyStatValue =999 WHERE ID=1 OR ID=5 OR ID=8; select ID,MyStatValue from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij>
ij> 3 rows inserted/updated/deleted
ij> ID         |MYSTATVALUE             
------------------------------------
1          |999.0                   
2          |-1.0                    
3          |-1.0                    
4          |-1.0                    
5          |999.0                   
6          |-1.0                    
7          |-1.0                    
8          |999.0                   

8 rows selected

When a new column is added in a VCF, it is handled by vcderby01 and the action: 'list'

$ java -jar dist/vcfderby01.jar -d database.db -a list
#ID	MYSTATVALUE	NAME	COUNT_VARIANTS
1	999.0	input.vcf	35
2	-1.0	input2.vcf.gz	35
3	-1.0	input.vcf	35
4	-1.0	input2.vcf.gz	35
5	999.0	vcf1461860749175	35
6	-1.0	vcf1461860749176	35
7	-1.0	vcf1461860749177	35
8	999.0	vcf1461860749178	35

Contribute

License

The project is licensed under the MIT license.

Citing

Should you cite vcfderby01 ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

http://dx.doi.org/10.6084/m9.figshare.1425030

Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030

Clone this wiki locally