VcfDerby01

##Motivation

Insert similar VCFs into an Apache Derby Database

##Compilation

Requirements / Dependencies

Since 2016-05-30 the compilation of the "Java API for high-throughput sequencing data (HTS) formats" (htsjdk) library requires gradle http://gradle.org

java compiler SDK 1.8 http://www.oracle.com/technetwork/java/index.html (NOT the old java 1.7 or 1.6) . Please check that this java is in the ${PATH}. Setting JAVA_HOME is not enough : (e.g: https://github.com/lindenb/jvarkit/issues/23 )
GNU Make > 3.81
curl/wget
git
gradle http://gradle.org is only required to compile the "Java API for high-throughput sequencing data (HTS) formats" (htsjdk)
xsltproc http://xmlsoft.org/XSLT/xsltproc2.html

Download and Compile

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make vcfderby01

by default, the libraries are not included in the jar file, so you shouldn't move them (https://github.com/lindenb/jvarkit/issues/15#issuecomment-140099011 ). You can create a bigger but standalone executable jar by addinging standalone=yes on the command line:

$ git clone "https://github.com/lindenb/jvarkit.git"
$ cd jvarkit
$ make vcfderby01 standalone=yes

The required libraries will be downloaded and installed in the dist directory.

edit 'local.mk' (optional)

The a file local.mk can be created edited to override/add some definitions.

For example it can be used to set the HTTP proxy:

http.proxy.host=your.host.com
http.proxy.port=124567

to set the gradle user home ( https://docs.gradle.org/current/userguide/build_environment.html#sec:gradle_properties_and_system_properties )

gradle.user.home=/dir1/dir2/gradle_user_home

##Synopsis

$ java -jar dist/vcfderby01.jar  [options] (stdin|file)

Options

-o|--output (OUTPUT-FILE) Output file. Default:stdout.
-d|--derby (VALUE) REQUIRED. path to Derby database storage directory. Default value : "".
-a|--action (VALUE) REQUIRED. action to perform. 'read': read a zip or a concatenated stream of vcf files and insert it into a derby database. 'list': list the available vcf. 'dump' dump one or more VCF. 'dumpall' dump all VCFs. 'delete' : delete one or more VCF by ID Default value : "".
-t|--title (TITLE) Try to find ##(TITLE)=abcdefghijk in the VCF header and use it as the name of the inserted VCF file Default value : "".
-h|--help print help
-version|--version show version and exit

##Source Code

Main code is: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/burden/VcfDerby01.java

Input VCF

The tool is optimized for storing very similar VCF files into an apache derby database , for example a big VCF file which would have been splitted into one VCF per transcript.

Schema

At the time of writing this document, the current schema is:


CREATE TABLE ROWCONTENT(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,MD5SUM CHAR(32) UNIQUE,CONTENT CLOB,CONTIG VARCHAR(20),FILTERED SMALLINT NOT NULL,START INT,STOP INT,ALLELE_REF VARCHAR(50));
CREATE TABLE VCF(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,NAME VARCHAR(255));
CREATE TABLE VCFROW(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1) PRIMARY KEY,VCF_ID INTEGER CONSTRAINT row2vcf REFERENCES VCF,ROW_ID INTEGER CONSTRAINT row2content REFERENCES ROWCONTENT);

The database is created the first time the database is created. It can be a slow process. Whole VCF lines are stored in a CBLOB. The embedded database is local and can be removed by a simple

rm -rf database.db

Inserting VCFs into the database

Inserting one VCF:

$ java -jar dist/vcfderby01.jar -a read -d database.db input.vcf input2.vcf.gz

#ID	NAME
1	input.vcf
2	input2.vcf.gz

You can insert a VCF any number of times:

$ java -jar dist/vcfderby01.jar -a read -d database.db input.vcf input.vcf input2.vcf.gz
#ID	NAME
3	input.vcf
4	input2.vcf.gz

The program also accepts concatenated VCF files:

$ gunzip -c input2.vcf.gz input2.vcf.gz input2.vcf.gz input2.vcf.gz | java -jar dist/vcfderby01.jar -a read -d database.db 2> /dev/null 
#ID	NAME
5	vcf1461860749175
6	vcf1461860749176
7	vcf1461860749177
8	vcf1461860749178

Listing the available VCFs

$ java -jar dist/vcfderby01.jar -d database.db -a list
#ID	NAME	COUNT_VARIANTS
1	input.vcf	35
2	input2.vcf.gz	35
3	input.vcf	35
4	input2.vcf.gz	35
5	vcf1461860749175	35
6	vcf1461860749176	35
7	vcf1461860749177	35
8	vcf1461860749178	35

Export one or more VCFs by ID

if more that one ID is given, the output is a stream of concatenated VCF.

$ java -jar dist/vcfderby01.jar -d database.db -a dump 5 8 3 2> /dev/null | grep "CHROM"
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3

if the output name ends with '*.zip', each VCF is saved in the zip as a new entry.

$ java -jar dist/vcfderby01.jar -d database.db -a dump -o out.zip 5 8 3
$ unzip -t out.zip

Archive:  out.zip
    testing: input.vcf.vcf            OK
    testing: vcf1461860749175.vcf     OK
    testing: vcf1461860749178.vcf     OK
No errors detected in compressed data of out.zip.

Dumping all VCFs

In a zip:

$ java -jar dist/vcfderby01.jar -d database.db -a dumpall -o out.zip
$ unzip -t out.zip
Archive:  out.zip
    testing: input.vcf.vcf            OK
    testing: input2.vcf.gz.vcf        OK
    testing: 00001.ID3.vcf            OK
    testing: 00001.ID4.vcf            OK
    testing: vcf1461860749175.vcf     OK
    testing: vcf1461860749176.vcf     OK
    testing: vcf1461860749177.vcf     OK
    testing: vcf1461860749178.vcf     OK
No errors detected in compressed data of out.zip.

as a concatenated stream of VCFs:

$ java -jar dist/vcfderby01.jar -d database.db -a dumpall|\
  grep CHROM
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1.variant	S2.variant4	S3.variant2	S4.variant3

Delete some VCFs by ID

$ java -jar dist/vcfderby01.jar -d database.db -a delete 1 8 10

Dumping all VCFs

Accessing the database using ij

ij is an interactive SQL scripting tool that comes with Derby. The libraries for ij is available in http://mvnrepository.com/artifact/org.apache.derby/derbytools or you can use something like:

sudo apt-get install derby-tools


 echo " connect 'jdbc:derby:database.db'; select count(*) from VCF; exit" |
 java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij> ij> 1          
-----------
8          

1 row selected

$ echo " connect 'jdbc:derby:database.db'; select ID,NAME from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij> ij> ID         |NAME                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------
1          |input.vcf                                                                                                                       
2          |input2.vcf.gz                                                                                                                   
3          |input.vcf                                                                                                                       
4          |input2.vcf.gz                                                                                                                   
5          |vcf1461860749175                                                                                                                
6          |vcf1461860749176                                                                                                                
7          |vcf1461860749177                                                                                                                
8          |vcf1461860749178                                                                                                                

8 rows selected
ij>>

Adding a column in the VCF:


$ echo " connect 'jdbc:derby:database.db'; ALTER TABLE VCF ADD COLStatValue DOUBLE DEFAULT -1.0; select * from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij> ij> 0 rows inserted/updated/deleted
ij> ID         |NAME                                                                                                                            |MYSTATVALUE             
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
1          |input.vcf                                                                                                                       |-1.0                    
2          |input2.vcf.gz                                                                                                                   |-1.0                    
3          |input.vcf                                                                                                                       |-1.0                    
4          |input2.vcf.gz                                                                                                                   |-1.0                    
5          |vcf1461860749175                                                                                                                |-1.0                    
6          |vcf1461860749176                                                                                                                |-1.0                    
7          |vcf1461860749177                                                                                                                |-1.0                    
8          |vcf1461860749178                                                                                                                |-1.0                    

8 rows selected

Updating the new column MYSTATVALUE:

echo " connect 'jdbc:derby:database.db'; UPDATE VCF SET MyStatValue =999 WHERE ID=1 OR ID=5 OR ID=8; select ID,MyStatValue from VCF; exit" | java -cp ./lib/org/apache/derby/derbytools/10.12.1.1/derbytools-10.12.1.1.jar:./lib/org/apache/derby/derby/10.12.1.1/derby-10.12.1.1.jar:./lib/org/apache/derby/derbyclient/10.12.1.1/derbyclient-10.12.1.1.jar  org.apache.derby.tools.ij
ij version 10.12
ij>
ij> 3 rows inserted/updated/deleted
ij> ID         |MYSTATVALUE             
------------------------------------
1          |999.0                   
2          |-1.0                    
3          |-1.0                    
4          |-1.0                    
5          |999.0                   
6          |-1.0                    
7          |-1.0                    
8          |999.0                   

8 rows selected

When a new column is added in a VCF, it is handled by vcderby01 and the action: 'list'

$ java -jar dist/vcfderby01.jar -d database.db -a list
#ID	MYSTATVALUE	NAME	COUNT_VARIANTS
1	999.0	input.vcf	35
2	-1.0	input2.vcf.gz	35
3	-1.0	input.vcf	35
4	-1.0	input2.vcf.gz	35
5	999.0	vcf1461860749175	35
6	-1.0	vcf1461860749176	35
7	-1.0	vcf1461860749177	35
8	999.0	vcf1461860749178	35

Contribute

Issue Tracker: http://github.com/lindenb/jvarkit/issues
Source Code: http://github.com/lindenb/jvarkit

License

The project is licensed under the MIT license.

Citing

Should you cite vcfderby01 ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

http://dx.doi.org/10.6084/m9.figshare.1425030

Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030

Main page | All Wiki pages | Compilation | Citing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VcfDerby01

Requirements / Dependencies

Download and Compile

edit 'local.mk' (optional)

Options

Input VCF

Schema

Inserting VCFs into the database

Listing the available VCFs

Export one or more VCFs by ID

Dumping all VCFs

Delete some VCFs by ID

Dumping all VCFs

Accessing the database using ij

Contribute

License

Citing

Clone this wiki locally