-
Notifications
You must be signed in to change notification settings - Fork 2
/
CHANGES.txt
1040 lines (1017 loc) · 80.3 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
3.3.1
- Fixed alignment concat where results could be truncated if several empty slices followed one another (e.g., if concat
A,B,C and A and B are empty, goby ca could yield an empty alignment, completely omiting alignments in part C.)
3.3.0
- Substantially reduced memory utilization for discover-sequence-variant (all modes).
- discover-sequence-variant could in some rare cases output the same base twice (when indels were extending prior to
the beginning of the read after equivalent indel region calculation). This fix improved indel performance when
training models with variationanalysis 1.3.3+.
- Initial work to develop[ models for genomic segments (see .ssi format and consurrent work in variationanalysis).
This is work in progress. Protobuf schema is in goby-io/protobuf/SegmentInformationRecords.proto
Models are developed in parallel with Keras (in goby3/python/dl) and DL4J (in variationanalysis).
- Updated genotyping model to state of the art (models/genotyping/1510204519948/, see evaluation results in the folder)
3.2.7
- Somatic output format: report predicted somatic allele in VCF.
- Variant.FromTo: defined SerializeID. This requires regenerating varmaps.
- sbi output: Set position and reference base on list copy. Fix for reference base begin '\0' in sbi files.
- vcf-to-genotype-map: Fix VCF to varmap. Incorrect genotypes added prior to this commit (since refactoring
to use VCF reader from HTSJDK in version 3.2.6). Show better statistics when creating the map. Fix for
indels not imported in varmap.
- GenotypesOutputFormat: Complete rewrite Fix VCF coding of het sites. Also, when using a model, now we check sampleCount,
in case the model does not use the matchesRef feature, because such models may return a default non-reference
base for sites with no coverage.
- Add usage to goby wrapper. Do not attempt to configure R unless the variable GOBY_USE_RJAVA is configured.
3.2.6
- Updated models for compatibility with latest code: genotyping model and somatic models are updated.
- Tested that models produced with variationanalysis (genotype and somatic) load in Goby and can be used
with the modes to generate VCF.
- Various bug fixes to last-to-compact mode. Bugs were triggered by output from more recent versions of Last than
tested previously.
- Discover-sequence-variations mode: fix VCF output for indels. Genotypes format mostly rewritten.
Was previously writing incorrect indels. Latest code produces VCF files tested for compatibility with RTG vcfeval.
- Discover-sequence-variations mode: Add minimum-P and stringent-P options to Genotypes output format.
- Rewrote VCFToGenotypeMapMode to use HTSJDK VCF parser. This should enable using BCF files as input as well.
- Fix for count of indels. The first equivalent indel region did not increment the count.
Counts on forward and reverse now match the number of supporting entries on each strand.
- Add supporting entry the first time an indel is created in a SampleCountInfo. The supporting entry was not set
on the first one.
- Apply count fixer to remove bases matching ref from list, when the mandatory filter has determined the base should
be removed. Previously was only removed from counts, but not from list of bases. One possible candidate for indel
performance problems we have tried to fix for a while.
3.2.5
- Fix issue with toProto that prevented using more than one sample for genotyping with goby.
- alignment conversion to goby: ignore missing MD tags (it is possible only some reads are missing them and we
still need to convert the other aligned reads).
- Upgrade goby to DL4J 0.8.0.
- fasta-to-compact: Do not use an assertion, but instead reset read index to zero and explain how to avoid the
problem.
- SBI format: add distance from start of read and end of read. Will be mapped to a density in next genotype mapper.
Should help variationanalysis models detect cases where end of alignment is fully contained within homopolymer region.
3.2.4
- Fix tally-reads mode.
- Some fixes to realignment of SNPs around indels.
- improvements to barcode remover (to trim bases from 5' end before removing barcode).
- Goby version now reports the commit that produced the distribution.
- Goby version, including commit now written to generated .sbi files.
- Introduce CommitPropertyHelper to record the specific commit that produced the version of Goby being used.
3.2.3
- Fix SNP bug in realignment around read insertion.
- Add queryPosition field to SBI output.
- Prevent the writing of sbi entries when AddTrueGenotypeHelper indicated the entry should not be added.
3.2.2
- Fix frequency of bases when indels are also present. Now correctly removes bases that
support the flanking sequence of the indel and do not double count.
- Many changes to how we store varmaps introduced to support indels (vcf-to-varmap).
The serialization format is incompatible with previous versions, so make sure you regenerate
varmaps from VCF.
- Adjust VCF output for compatibility with REF/ALT conventions. This makes it possible to measure
performance with standard tools such as RTG vcfeval (http://realtimegenomics.com/products/rtg-tools/).
- Keep counts of indels separately for forward and reverse strand.
- vcf-to-varmap mode: improved semantic of --chromosome-prefix option allows removing (e.g., -chr)
or adding (+chr) prefix to chromosome name.
3.2.1
- fast-co-compact: fix a bug introduced on 10/6/2016 which created negative read entries.
- catch a number of exception that can be thrown by HTSJDK when processing BAM files. Exceptions
are caught so that an error on one alignment does not interrupt processing of an entire alignment.
Errors are shown in log.
- vcf-to-genotype-map mode now supports (b)gzipped vcf input.
- vcf-to-genotype-map: fix bug that manifested itself when the vcf had a single genotype field.
- vcf-to-genotype-map: add chromosome-prefix argument to help import VCF where the chr prefix is missing.
3.2
- Remove memory leak when reading SAM/BAM files. This was the likely cause for running out of memory error in
compression benchmarks (had nothing to do with compression but with the conversion of SAM/BAM to goby representation).
- Disabled tests that could not succeed anymore (because of choices we made in Goby 3, such as lack of auto-upgrade
for alignments produced with Goby 1 and 2.)
- BAM/CRAM support. Added an option to bypass the header check on SO:COORDINATE. Use
-x HTSJDKReaderImpl:force-sorted=true to force Goby to consider an alignment sorted.
- SBI format: add ability to add true labels while writing the file. Add support for downsampling sites without
variants.
- Genotype format: reorganization to support calling with deep learning models trained with variation analysis.
3.1
- Reorganize model prediction to facilitate installing new versions of the variationAnalysis jars.
Goby 3.1 is now compatible with variationanalysis 1.1.1.
- Replace models with versions trained with variationanalysis 1.1.1.
- Add somatic mutation models trained with whole genome data (ICGC GoldSet).
3.0.0
- Support reading BAM alignments directly with Goby APIs.
- Support probabilitic models for calling somatic variations, trained with deep learning.
2.3.6
- Improve performance of realignment around indels when processing RNA-Seq reads. Previous versions of Goby had
scalability issues and kept data around from previous chromosomes. This was OK when processing DNA-Seq inside GobyWeb,
which splits data into genomic slices, but not when trying to process one or more RNA-Seq alignment files.
Performance has also been dramatically improved by fixing a bug on indel equality.
2.3.5
- Add a mode to infer sex of samples from data (tested on exome data). Useful as quality control to check the
data you get checks out with respect to the what is known about the samples. See --mode infer-sex. Works
faster on sorted alignments where the index is used to jump quickly to the human sex chromosome.
- Prevent AbstractAlignmentToCompactMode to print more than 10 warnings if quality scores are not available in
an alignment.
- suggest-position-slices: fix a bug in that caused some slices to overlap. Found with a job with hundreds of
alignments, so not common.
2.3.4.1
- Add an option to the fasta-to-compact mode that will convert a set of files and concatenate the result
to a single compact-reads file (see new --concat option).
- Add a mode to test that the connection from Goby to R is working (requires JRI and R built
with shared library support). The mode is called test-r-connection (tcr).
- Restore STRICT_SOMATIC filter.
- Close files opened when loading Goby Alignment header and index files. This fixes a too many file error
that could occur when loading hundreds of alignments simultaneously.
- Allow lenient import mode for TSV files. This makes it possible to convert TSV files to lucene.index when
they have been created with Goby in the past with a \t character as last character of the column line.
- Fix a bug that caused some slices to occur within annotations, despite the --annotation option being given
on the command line. The problem was that the chromosome index was not /obtained from the genome and was set
to zero, always.
2.3.4
- Optimize the speed of genotyping when some sites have very high coverage (>500M bases).
Now sub-sampling to keep a random set of 10,000 bases for such sites. Expose the default
sub-sample size with a dynamic option called sub-sample-size in IterateSortedAlignmentsListImpl.
(-x IterateSortedAlignmentsListImpl:sub-sample-size <int>)
- LastToCompact mode now supports the import of paired end alignments produced by Last's last-pair-probs.sh.
- LastToCompact mode now supports the import of quality scores (lastal must be done with -Q1 since the
import assumes Phred quality scores on the q lines).
- Add two methods to AlignmentReader to determine the minimum and maximum genomic locations represented
in the reader. This is useful when suggesting slices to split a set of alignments. This commit includes
a fix for possible null start or end positions in slices generated with suggest-position-slices.
- Fix a problem with run-in-parallel where some threads would never finish when they do not detect
the keyword. Now indicate that the thread finished so that others can start when the processing
completes.
- reads-file-stats: remove any path from basename in the output.
2.3.3
- IterateSortedAlignmentsListImpl: Use a WarningCounter to limit warnings to 10 instances. This is needed to
avoid writing Gb of log output when the threshold is met.
- discover-sequence-variants somatic output: Make it possible to run a simple trio design by removing the
requirement for a germline sample.
- discover-sequence-variants somatic output: Earlier versions were reporting somatic variation candidates
when two parents are homozygotes and the somatic samples was Het (the fisher p-value with each parent is
very significant in this case, but does not indicate a somatic change). This also improves q-values because
they are less results that need to be corrected.
- discover-sequence-variants somatic output: Add an error message when a sample is mis-spelled in the covariates
file.
- Refactor code base to keep base counts for forward and reverse strands separately in SampleCountInfo.
- Normalize somatic priority score by number of mapped reads, and number of parents and germline samples used in
the calculation.
- Add a StrandBiasFilter in somatic analyses. The filter rejects variations that are not represented on both
strands when at least j reads support the variation. The value of j is set to 9 by default, so a variation with
10 bases needs to have at least the two strands represented.
- Remove candidate somatic variation that can occur when the germline samples have less coverage than the
somatic sample. Now require at least twice the coverage in the somatic sample than the minimum coverage
in the germline samples.
- Add a STRICT_SOMATIC filter that flags genomic sites where some bases appear in support of the variation
in the parents or germline samples. Please note the VCF spec semantic: PASS indicates that all filters passed.
This means that lines with the STRICT_SOMATIC value in the FILTER column failed that test.
- Fix a bug in FDR mode that would not handle vcf files with non default FILTER values.
2.3.2
- run-parallel-mode now supports paired input files.
- fasta-to-compact: add --force-quality-encoding option to force the quality values within the specified
encoding range.
- suggest-position-slices: fix problem where first slice of genome was omitted from output (with new split
by number of bytes option introduced in 2.3).
2.3.1
- Fix for https://github.com/CampagneLaboratory/goby/issues/3
- Upgrade commons-io and dsiutils to latest jar versions. Log messages when scanning reads file with cfs mode.
- DistinctValueCounterBitSet: now grows to biggest size at construction time.
- Fixed a performance problem. When reading large reads file (>10GB), performance of ReadsReader would degrade
over time. This was due to caching of data in static protobuf methods of ReadCollection. We now create a
builder instance that gets garbage collected when it is no longer used. This fixes a subtle performance
problem. The same fix has been applied to alignment readers.
2.3
- concatenate-alignments mode: add ability to restrict output to a genomic slice (see -s and -e options).
- API change: AlignmentSliceHelper makes it easier to parse and process genomic slices for sets of alignments.
- concatenate-alignments mode: now transfers read groups to output in the same way that non-sorted concat does.
- concatenate-alignments mode: Add a mechanism to override/define read groups/read origin info on the fly when
reading alignments that did not include them. Coupled with changes to compact-to-sam, this makes it possible
to get BAM files with read groups directly from Goby alignments.
- compact-to-sam mode: fixed output of read groups, which were not correctly written for platform, platform unit,
and library.
- suggest-position-slices: add --restrict-per-chromosome option. When this switch is provided, slices will be
restricted to start and end on the same chromosome. This is useful to produce intervals to give Mutect,
for instance.
- Trim mode: add --trim-left --trim-right parameters to control trimming of specific sequence extremities.
- Trim mode: add --verbose flag.
2.2.1
- FDR mode: add ability to read groups from VCF file and adjust columns/fields marked as p-value. Mark adjusted
columns with group q-value.
- Somatic variation output format: annotate somatic p-value column with 'p-value' group. Fix the type of the p-value
column to be a number (was String in release 2.2).
- Somatic variation output format: handle unrecognized sample-ids in the parents column.
- discover-sequence-variants mode: add assertion to give hint to user that syntax is incorrect in for -s and -e options.
- compact-file-stats mode: print progress when scanning reads files. Use a buffered reader to improve read file
parsing performance.
- discover-sequence-variants: adjust multiplier for left-over filter for somatic variations output format.
- discover-sequence-variants: Add a new filter to remove indels at a site where a sample shows lots of distinct
possible indels. Indels at these sites are very likely to be artefactual. We count the number of samples where
three distinct indel genotypes are seen. If more than 1/4 of the samples have likely indel artifacts, we remove
all indel candidates at the site. maxIndelPerSite:Maximum number of distinct indels at a given genomic site.:1
Additional filter: fractionOfSamples: Maximum fraction of samples that can have an indel candidate for the indel
to be considered (indel candidates that occur in many samples are more likely to be spurious).:0.25
This filter is added to the somatic variations output format. See dynamic options for this filter with --x-help
2.2
- Remove threshold effects when calling genotypes in several samples. Modified the filters to not remove bases in
specific samples when the genotype survived filters in at least another sample (previous versions reported these
threshold edge effects as differences, which could be confusing, this version simply shows the marginal raw base
counts in samples where the genotype could have been filtered by a filter, which makes it easier to compare the
strength of the genotype support across samples). This adjustment was done for both base genotype and indel genotypes.
- LeftOverFilter: now uses minVariationSupport as minimum threshold.
- Mode suggest-position-slices: add option number-of-bytes to suggest slices with a uniform number of compressed
bytes. This option aims to provide more balanced slices in bases where the genome as very non uniform coverage
by position. With this option, the number of slices is determined to yield slices that need to decompress about
the amount of bytes indicated on the command line. `
- Framework API change: introduce class PositionToBasesMap<T> to use as type for positionToBases. The class provides
methods to get the range of positions described in the map. This unfortunately requires changes to all clients/
implementations of IterateSortedAlignments<T>.
- Mode discover-sequence-variants: Fix various problems that prevented reporting genotypes for deletions (i.e., C/-).
- Fix a potential NPE in GroupAssociations when samples are null.
- Fix for issue #2, see https://github.com/CampagneLaboratory/goby/issues/2
- Expose comparator in SortedAnnotations.
2.1.2
- Upgrade xstream to version 1.4.3. This fixes the compatibility problem seen when running goby 2.1.1 with java 1.7+.
Goby 2.1.2 should run with Java 1.7+, but more testing will be needed to rule out other migration problems. If you
are running JDK 1.7+ please let us know any issues you encounter.
- Fix VCFParser issue https://github.com/CampagneLaboratory/goby/issues/1. The issue could be triggered when the FORMAT
column changed from line to line.
- VCFWriter: improve support for VCF group associations. The Goby VCF parser makes it possible to associate columns
to groups (these associations are written in a ##FieldGroupAssociations field).
- Methylation rate VCF output: mark the context column with group 'indexed'.
- Do not try to upgrade alignments when reading the header to concatenate permutations. This is not necessary and can
open too many files when we are trying to concatenate alignments.
2.1.1
- Add extract-splicing-events mode. This mode is used by GobyWeb 1.9 to extract splicing events from spliced
Goby alignments (generated either by GSNAP or STAR at this time).
- Trim mode:Fix bug that caused quality scores to be duplicated (the bug triggered the assertion that checks
that sequence length equal quality length).
- Trim mode: Some sequence must remain after trimming to append to the output.
- Fix bug in alignment-to-annotation-counts when counts would be zero for samples whose name contained a
period '.' The code was incorrectly stripping alignment extensions twice.
- alignment-to-annotation-counts: add comparison description to t-test statistic column name (e.g. t-test[A/B] rather
than t-test). This change makes it possible to retrieve the t-test p-values when more than one comparison is
performed.
- Fix a bug where RandomAccessAnnotations could return results on a different chromosome.
- Add annotation loading test and fix for when annotation file is truncated. Goby now loads annotations up to
the truncation and logs truncated lines.
- Correct calculation for fold-change-magnitude column in goby diff exp mode. Previous calculation under-estimated
magnitude when comparing low rpkms.
- Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension
(this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not
return 404 errors for missing content).
2.1
- Improve compression of hybrid-1 codec by about 8% on average at similar speed. You can enable this improvement with
option -x AlignmentCollectionHandler:symbol-modeling=plus. This option will be made the default in a future release.
It is not currently the default since Goby 2.1 has not been integrated into IGV and will need time to propagate from
IGV dev to production builds.
- Remove import of NH:i bam tags as read-origin-index, since the NH tag seems to contain different types of data
depending on the aligner that produced the alignment.
- compact-to-sam mode: fix bug where bam tags containing a colon character (:) would be truncated after the first
colon. Thanks to Vadim Zalunin for reporting this problem.
- compact-file-stats: Add a feature to scan only alignment headers.
- VCFParser group associations: Make it possible to lookup an INFO column by either INFO/colname or colname.
- NonAmbiguousAlignmentReader: fix an NPE when reading alignments where all entries have the ambiguity field.
- Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension
(this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not
return 404 errors for missing content). Thanks to Jim Robinson and Helga Thorvaldsdottir for reporting this issue.
2.0.1
- Release Goby C/C++ APIs under the LGPL license version 3 to make it possible for companies to incorporate support
for Goby formats in their tools. Thanks to Collin Hercus for the suggestion. Please note that part of the Goby
Java APIs are already licensed under the LGPL (anything packaged under the Goby-io.jar file).
- C++ API: Support to set placed unmapped (i.e., mate that does not map is recorded with the read that mapped)
and clipleft/clipright with quality scores.
- Fix problem when using a genome backed by a samtools/picard faidx file. In some cases, read bases would be returned
shifted by one position. Thanks to James Bonfield for reporting this problem.
- SAM/BAM tags start at column 12, index 11. --preserve-all-tags could skip the first tag on some datasets (e.g.,
dataset where the first tag was not a MD:Z or RG:Z). Thanks to James Bonfield for reporting this problem.
- Introduce interface for ReadsWriter. Introduce mock implementation to write reads to text. This is useful to write
more intelligible JUnit tests.
- mode sam-to-compact now supports option --read-names-are-query-indices to indicate that the read names are integers
(typically produced by compact-to-fasta from a chunk of a large file).
- Fix a bug in reformat-compact-reads which did not trim quality scores for paired end reads correctly.
2.0
- Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
- Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
compact-to-sam.
- Refactor AlignmentWriter to introduce an interface and make it easier to create facades that modify the behaviour
of the default writer. For instance, such a facade is BufferedSortingAlignmentWriter, which keeps a number of entries
in memory to re-sort these entries by genomic position. This feature is used when importing already sorted SAM/BAM
files to create sorted Goby alignments and the files contain spliced alignments that would cause mis-ordering during
conversion.
- Make default chunk-size dependent on the type of chunk codec used. This is useful because hybrid compression does
better with larger chunk sizes (default chunk size for hybrid is 30000, 20000 for bzip2 and 10000 for gzip). The
default chunk size can be overriden with -x MessageChunksWriter:chunk-size=int
- Add ability to preserve SAM/BAM read groups. Read groups are automatically preserved if present in the input BAM file.
The concatenate mode automatically reassigns read_origin indices (see field read_origin_index) to prevent conflicts
when Goby files from different origins are concatenated. The approach we use is to keep the most specific read origin
information, and let the client decide what origins/groups are equivalent given the type of analysis at hand.
Read groups are supported by the hybrid codec (and therefore stored very efficiently), are imported from BAM with
sam-to-compact and are exported back to SAM/BAM with the compact-to-bam mode.
- Add ability to preserve all BAM attributes during import and export. Use --preserve-all-tags in mode sam-to-compact
to enable this.
- Add ability to preserve all quality scores. Use --preserve-all-mapped-qualities in mode sam-to-compact.
- Supports bzip2 compression in fasta-to-compact mode and sam-extract-reads (use the -x MessageChunksWriter:codec=bzip2
dynamic option).
- Renamed SortMode to Sort1Mode. Renamed SortLargeMode to SortMode.
- Added SortLargeMode which can sort compact alignments of any size, multithreaded.
- Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode
throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the --sorted option
would not work in some 1.9 versions of Goby after samtools/picard changed the semantic of the record comparator Goby
relied upon to verify the input was indeed sorted by position. This made it impossible to convert already sorted BAM
files as sorted Goby alignments).
- Moved error messages produced when parsing the command line of a mode to after usage. This is a simple change that
will make it easier to diagnose problems on a command line without having to scroll back up the console.
- Prevent logging when the log4j system has not been configured. For some reason, LOG.isDebugEnabled can return true
when the logging system is not initialized. For SamHelper, this means calling String.Format million of times to
create debug output that is never shown. This change dramatically improves the performance of the sam-to-compact mode
when logging is not properly configured.
- Refactor dynamic options with a central registry, and make GobyDriver handle option parsing.
This removes duplication of code parsing for each mode that would need dynamic options.
- methylation region can now estimate empirical p-values. Empirical P-values require biological replicates in at least
one of the groups under analysis. Two passes over the data are required. In the first pass, the empirical null
distribution is observed by comparing pairs of samples in the same group. In the second pass, this distribution is
used to estimate the p-value of observing the between group differences. Such empirical p-values can control FWER
in the strong sense.
- Support empirical p-value for individual bases (VCF output). Write a DMR INFO field that stores how many significant
sites were found in a moving window that ends at the site (significance is judged according to a configurable
threshold on the empirical p-value).
- New empirical-p mode to estimate p-values from data in text files. This makes it easier to derive p-values for
simulated data or counts generated by other tools than Goby.
- Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby
tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files
currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely.
- Fix issues with the barcode-decode mode. Add support for processing fasta/fastq files.
- vcf methylation format: removed space in name of C and Cm group INFO fields.
- Add a draft implementation of random access sequence interface that can read a fasta file indexed with faidx.
- Introduce chunk codecs for protocol buffer encoded collection messages (supports both reads and alignments).
- Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and
to limit the number of alignment entries to output (-n).
- The RandomAccessSequenceCache had problems with bases that weren't G/A/T/C/N. Such bases would be skipped silently,
causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a
R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following
the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following
bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in
this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem.
- last-to-compact: add option to substitute some bases with others in the aligned read.
- Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a
slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the
method did not enforce slice limits.
- The code base was simplified by removing the now obsolete align mode.
- Fix a problem where sample names with several dots were stripped of too many extensions. For instance, a.b.c.entries
would be reduced to a, which could be non-unique across the remaining samples. Problem reported by Fang Fang in her
data on GobyWeb.
- DistinctIntValueCounterBitSet now uses LongArrayBitVector as its bit set implementation. The java BitSet implementation
was found to throw java.lang.ArrayIndexOutOfBoundsException for indices that should fit easily in a bit array (e.g.,
2,080,948 which can stored with about 230 MB).
- AlignmentEntry field insertSize is now stored in protobuf with sint32 rather than uint32 since negative values can be
stored in this field.
- Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
- The mode sample-quality-scores now supports .sam, .sam.gz, and .bam files to make a guess at the scale of
the quality scores contained in the file.
- Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
compact-to-sam.
- Fixed a problem with concatenate-compact-reads that previously transferred only specific fields of a read to the
output file. concatenate-compact-reads now transfers all fields (including pair sequence and quality score).
- version mode now prints an official version number if the jar constains a VERSION.txt file.
1.9.8.3.1
- Fix a bug related to writing paired end alignments in the Gsnap parser (C API)
1.9.8.3
- Added a methylation_region format capable of averaging methylation rates for different cytosine contexts over
arbitrarily defined regions.
- Added a diploid genotype filter to use when calling genotypes in a diploid genome.
- discover-sequence-variants format compare_groups: Write distinct fisher p-values for each comparison pair
- Fix FDR mode output for TSV format. Make open --column-selection-filter work.
- Fix bug that prevented methylation vcf output from writing any line.
1.9.8.2.1
- Fix bug in GenotypesOutputFormat that caused GenotypesOutputFormat to throw an exception when processing some sites.
1.9.8.2
- Make it possible to activate indel calling without recompilation. Mode discover-sequence-variants now accepts
the boolean argument --call-indels true/false.
- Preliminary support for calling indels with discover-sequence-variants. Candidate indels are now written
in the formats that use GenotypeOutputFormat (e.g., genotypes, compare_groups, allele_frequency).
The method of Krawitz et al is used to determine the equivalent indel region for each possible candidate.
After possible realignment, and filtering to remove possible errors, EIR are reported with their frequencies.
Please be advised that the VCF spec(s) are rather vague and as a result often interpreted differently by different
programmers. This is especially true of the parts of the specification(s) that describe how to report indels. As a
result of this situation, you might run into problems when trying to loading indel containing VCF files generated
with Goby into other tools.
- vcf-subset: Add ability to exclude positions at which all samples match the reference.
- Add a replacement for the VCF-tools VCF-subset program. The Goby tool is orders of magnitude faster.
- Improve vcf-compare mode. Now has the ability to provide a random samples of the positions that differ between the
files being compared. Random samples are calculated for each kind of difference (missing from one file, missing
one allele, two alleles, different genotypes)
- vcf-compare now outputs Ti/Tv ratios for each sample in input file (in the output file only).
- Fix scalability problem with local realignment code. Local realignment around indels would slow down as more entries
were processed. This is now fixed so that speed is constant across large alignments.
- Fixed index file writing. In some conditions, part of the alignment past the 2GB mark were not accessible
with skipTo when reading files larger than 2GB. Use the upgrade mode to fix old alignments at a specific time, or
use Goby as usual to have alignments upgraded on the fly.
- Add mechanism to upgrade/fix large alignments indices with Goby 1.9.8.2. The upgrade mechanism uses concatenate
alignment to rewrite an alignment index file if the size of the entries file exceeds 2GB. This is rather slow as
the process reads and writes large alignments, to produce the new index file. While slow, upgrading is still faster
than aligning the reads again. The process also requires approximately double the alignment size as the new alignment
files are written. Alignments smaller than 2GB are quietly ignored since they were not affected by the bug.
- Codecs: Add support to decode alignments with a codec in AlignmentReader.
- Improved ReadsReader to find a suitable decoder when several codecs exist.
- Prevents local realignment from running out of memory when processing positions where clonal reads create huge peaks.
- Make filterIndels remove from sample count info object, not just form list of bases.
- Fix VCF genotypes that could look like 0/0/1/1 to be 0/1 (seen with indels only).
- only write allele base count in VCF BC field when the count is not zero (useful with indels).
1.9.8.1
- Discover-sequence-variants: add ability to describe zero, one or more group comparisons. Syntax is A/B,A/C to compare
group A to B and group A to C. Additional pairs can be described, separated by coma.
- Extend methyl-stats mode to estimate fraction of methylated cytosine observed in CpX contexts.
- Discover-sequence-variants, genotype format: Fix a bug where alleleSet was cleared in each sample, rather than before
any sample is processed. This made it possible for some positions to be ignored erroneously when samples were given
on a specific order on the command line. Specifically, positions would be ignored if they were not typed (i.e., not
enough good bases) in the last sample given on the command line.
- Optimize merging of TMH when the files are large (>100M compressed).
- Fixed a major bug where NonAmbiguousAlignmentReader would stop iterating after encountering an ambiguous alignment.
Alignments with shorter reads were much more likely to be affected.
- Fix sam-extract-reads for paired-end BAM files. Each BAM file contains both pairs. To convert to compact reads, the
input BAM file must be sorted by read name, since this is the only way we can put the pairs back together in one
Goby record.
- Mode discover-sequence-variants now limits the maximum coverage per site in order to limit the impact on peak memory
of a few very high coverage sites. The default setting is set to 500,000x and can be changed with
option --max-coverage-per-site
- Switched IndexedIdentifier to an AVLTreeMap to help scale when we have millions of elements to compare in diff exp.
- Fixed a subtle bug in IterateSortedAlignment that would cause iteration to return partial results for some alignments
when restricting results to a window. The problem would manifest more clearly for alignments against genomes where
contigs have smaller indices than chromosomes and chromosome sequences are listed in non-increasing order (e.g., chr
16 appearing before chr 10) and restricting to window from chr16 to MT (which should include chr 10 in that genome,
but returned no result on chr 10).
- Trim mode: Fix exception that could occur when trimming reads with no quality scores.
- Change goby script to request the bash shell explicitly. This is needed on systems where bin/sh is not a synonym for
bash. Thanks to Martin Frith for catching this on Ubuntu.
- Change how targetLengths are concatenated. It turns out that last-to-compact needs alignment entries matching
the target to record the length in the alignment. We need to keep any length seen when we concat because the first
chunk may just not have the length for the remaining parts..
- Improved logic for --paired-end filename support in the fastaToCompactMode.
- Fix a NPE in suggest-position-slices that could occur with very small alignment files.
1.9.8
- The BaseStats utility was transformed into a Goby mode (base-stats). The new mode has the ability to tally occurrence
of CpX motifs in reads. Useful as a proxy to the amount of unconverted Cs in bisulfite converted reads.
- The methyl-stats mode take a VCF file produced by Goby methylation output and a genome and calculates various
statistics about the distribution of fragment lengths between CpG interrogated by the assay.
- FDR mode now accepts --column-selection-filter to select columns matching string.
- Proof of principle that protocol buffer can seamlessly cohabit with data-specific compression schemes. The
--codec option on fasta-to-compact is introduced to activate compression of reads when writing compact reads.
The codec provided (called read-codec-1) achieves about 10-12% better compression of read files than pure
protocol-buffer encoding. This read-codec-1 codec stores bases and quality scores with an arithmetic coder in
a protocol buffer field called 'compressed_data'. Please note that we do not recommend using this option at
this stage since the C/C++ APIs cannot load data encoded with this codec at this time.
- Add ability to run alignment-to-annotation-counts on a specific genomic region (see --start-position and
--end-position).
- alignment-to-annotation mode has a new option (--remove-shared-segments). When active, this option will remove
annotation segments when they partially overlap with more than one primary annotation id. When this option is
selected and the primary id is a gene, and secondary id is an exon, the mode will remove exons that are associated
with several genes. When the option is used with transcript id as primary and exon as secondary, exons are removed
that are shared across different transcripts of the same gene.
- mode base-stats now supports multiple input files.
- VCFParser will now set column type when reading TSV files by using TabToColumnInfoMode to scan the actual values
stored in the TSV file. The first time this is done for a each file, a .colinfo file will be created and then
used if the file is read again by VCFParser in the future.
- Added the mode tab-to-column-info to read the data from TSV files to determine the the column types
(double/integer/string). Write a .colinfo file detailing the column names and types.
- Upgraded to SAM JDK 1.52
- Modes sam-to-compact and sam-extract-reads now set SILENT validation before reading file header. This is required
because the SAM JDK validation rules are more stringent than required by the specification. This means that
some valid SAM files (per the SAM spec) cannot be parsed without error when the strict validation is used.
- Fixed a bug with ReadsQualityStatsMode when when SampleFraction == 1.0d, such as for files with a small
number of reads.
- Mode sam-extract-reads now supports extracting reads from paired samples. See the new options --paired-end
and --pair-indicator. These options work similarly to the fasta-to-compact options.
- Fix problem with suggestion-position-slices that could create empty slices.
- Fix bug in discover-sequence-variants methylation format that wrote methylation rates only for up to two samples.
- Fix bug in alignment-to-counts that caused problems with large alignments.
1.9.7.3
- Fix allele frequency format to write genotype first in FORMAT per vcf spec.
- Add new INFO fields in compare group vcf format to show allele counts in each group.
- Ability to support short versions of mode names, such as "compact-file-stats" has the short mode
name "cfs". There is a default short mode name generation implementation in
AbstractCommandLineMode.getShortModeName() but each mode class can override this method in the case
of short mode name collisions. In the case of collisions, the command line parser will not offer/accept
ANY short mode names for the classes in question.
- SamToCompact: Generate sorted goby alignments when a sorted BAM files is provided as input (use --sorted
flag to activate this option). Thanks to Bradford Powell for the suggestion and draft implementation.
- Fixed a bug in tally-reads that was triggered by reads of different lengths. Thanks to Adrian Platts for
the bug report.
1.9.7.2
- Fix realignment around indels bug that prevented reads from being realigned to the left in exome data.
Now correctly updates the start position of the moving window.
- Renamed AlignmentEntry.splicedAlignmentLink to AlignmentEntry.splicedForwardAlignmentLink and added
AlignmentEntry.splicedForwardAlignmentLink so splice links can be both bidirectional and more than
two segments long. This change is included in the C/C++ APIs and make it possible for GSNAP to write
splice information to Goby alignment files.
- FDR mode now supports reporting the top n hits irrespective of corrected q-value threshold (top n hits are
defined by the ranking produced by ordering the hits by increasing p-value, for the last column adjusted).
- Significantly reduced memory consumption when performing FDR BH adjustment on hundreds of million of elements.
- VCFWriter now writes missing value '.' in ID, ALT and FILTER fields, as required by VCF 4.1 documentation
(http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)
This change is required to read the files generated by Goby with the latest version of Tribble used in IGV EA.
- AlignmentToTextMode will now display splice information.
1.9.7.1
- alignment-to-counts now generates indexed base-level histogram files. Indexing makes it possible to jump quickly
to a new genomic location in IGV. This is especially useful when viewing coverage for tens of tracks.
- Filter out ambiguous reads from alignment-to-counts base level histogram output. Pre-1.9.7.1 behaviour can be
obtained by setting the argument --filter-ambiguous-reads to false.
alignment-to-counts: also tried a new way to create base-level histograms from sorted alignment files.
This turns out to be about 3 times slower than the current approach. We still keep the new approach because it
should scale to any size alignment. Mode alignment-to-count will use to the new approach if an alignment is sorted
and has more than 50 million aligned reads.
- Filter out ambiguous reads from alignment-to-annotation-counts by default. Pre-1.9.7.1 behaviour can be obtained
by setting the argument --filter-ambiguous-reads to false.
- Add ability to switch off the recording of sampleIndex. This is useful when concat is just used to put pieces
of a large alignment back together after splitting reads for parallel processing.
- Do not print indices at the end of upgrade. This caused upgrade to fail on some alignments with an exception.
- Extended IterateAlignments to create alignment reader with a configurable AlignmentReaderFactory.
- Set the default normalization method for alignment-to-annotation-count to bullard normalization only.
- Fix a bug in VCFParser that affected parsing tab delimited files. Some files would be parsed with a tab in the
value of the last column, separating the values of the last two actual columns.
1.9.7
- Now using protobuf 2.4.1. Please upgrade your local version of protobuf if you are recompiling from sources.
- AlignmentWriter now correctly records Goby version in header upon close(). This fixes a problem when alignments
read from read-only files would fail upon trying a new upgrade.
- Optimized the performance of VCFParser on files with large number of columns. The VCF format seems designed
without performance in mind, so it is hard to come up with a reasonably fast implementation. The current
implementation of the Goby VCF parser can only process about 8,000 lines of compressed VCF per second on
a desktop machine.
- AlignmentEntry schema change: a new field sample_index holds the index of the alignment from which the
entry was read. This is useful when concatenating over multiple alignments and realigning reads that span
indels, to reliably track the alignment origin of each entry. The concatenation readers have been
modified to set sample_index accordingly. Please note that the activeIndex field of the sorted reader
is not a reliable way to identify the alignment of origin when realignment is active. Please use the
new sample_index field instead.
- We have added the capability to perform on the fly realignment around indels. This feature is available
in mode discover-sequence-variants and in concatenate-alignments. The feature is activated with the new
--processor realign_near_indels option. When the option is provided, a compressed reference genome must
also be given on the command line (with the --genome option). This will trigger realignment of reads in
regions where candidate indels are found by the aligner. The algorithm is very fast, in fact much faster
than previously described approaches and consumes a reasonable amount of memory (function of maximum
depth of coverage in the region where candidate indels are observed, but typically <2GB). Realignment
correctly removes artefactual SNPs that can be introduced when an aligner fails to align the read ends
properly through a read deletion. Please note that this version realigns read deletions. Realignment of
read insertions has not been implemented.
- Make it possible to open an alignment if the header file is present, but the entries file is missing.
This allows to read the header only, for instance when we need to load counts and have access to targetIds.
- Add mode to convert annotations to counts archive format.
- Add new coverage mode to calculate coverage stats over annotation regions. When annotation regions are
defined with capture regions, this mode outputs enrichment efficiency efficiency and depth of coverage for
specific proportions of captured sites.
The mode uses just .header and .count files and traverses count transitions. The algorithm used to iterate
through count transitions is very efficient (for instance it takes about ~20 seconds to estimate coverage
stats for an alignment with ~20M aligned reads). Count files are produced with GobyWeb together with the
alignment or with the alignment-to-counts mode.
- Add CountBinningAdaptor, useful to bin counts on the fly at any resolution for display in IGV.
- Added ability to record total number of bases and sites seen in count archive.
- Added a new mode (file-to-attributes) to generate a sample attribute file suitable for loading in IGV.
Useful when files are named with the convention attr1-attr2-attr3.counts
1.9.6.1
- Patched VCF output for compatibility with VCF specification. Specifically, we now write . in the QUAL
field and write genotype as the first field in the methylation output format. Additionally, we only
write a VCF line if the site can be typed in at least on of the samples. This changes make Goby VCF
output compatible with the IGV 2.0 VCFTrack.
- Fix a bug in merge that could trigger a ArrayIndexOutOfBoundsException with some alignments.
1.9.6
- AlignmentReaderImpl now supports full random access to an alignment. Use reposition(ref,pos) followed
by skipTo(ref,pos) to obtain the first entry matching at location (ref,pos). Prior to 1.9.6, the
reposition method would not reposition to a location already visited forcing clients to close the
alignment reader and reopen it (this new behaviour should improve performance in IGV).
- The indexing logic used in versions of Goby up to 1.9.5 (inclusive) had subtle flaws. This could cause
the skipTo method to behave incorrectly for some aligments. For instance, if reads matched on target N
at a position larger than the length of target N+1, these reads would not be returned by skipTo.
Thanks to Alec Chapman for identifying these issues.
We have corrected the problem and added additional unit tests to check the behavior of the implementation
in various edge cases. A consequence of this change is that the new indexing logic requires recalculating
the .index data structure for alignments sorted and indexed with a version of Goby prior to 1.9.6.
We provide a new mode, goby upgrade, to perform these calculations and fix such alignments. To upgrade
alignments off-line, simply do:
goby 3g upgrade [files].
This command will upgrade each alignment corresponding to the filenames provided. It skips those alignments
produced by versions of Goby that do not require upgrading. The upgrade process creates a backup of the
files that are affected: .index and .header are backed to .index.bak and .header.bak respectively.
The upgrade process is relatively fast, in our tests we upgraded a 750Mb alignment file in 2'30".
- Version 1.9.6 will try to upgrade alignments on the fly to the new version of the index data structures.
- Detect when FastaToCompact is running in API mode versus command line. Do NOT do System.exit in API
mode and instead throw exceptions. Also, API mode doesn't run conversions in parallel but instead runs
them serially for easier exception catching.
- VCFParser now splits headers by tab instead of whitespace so column names that contain spaces
are read correctly.
1.9.5
- Determine alignment sortedness and index state from the header and by checking that the index file exists.
This allows to recover alignments when the index file was deleted. In such cases, sorting the alignment can
be done again, this is preferable to losing the alignemnt data.
- New mode simulate-reads will generate reads artifically against a reference sequence. We use this mode
to create simulated datasets of bisulfite converted reads or mutated reads and to test that Goby produces
the expected results.
- Show phred scores in DisplaySequenceVariants (tab + base)
- Add a QualityEncoding.PHRED in case one just wants to transfer quality scores without changing quality scale
- Rewritten sam-to-compact mode that handles sequence variations better, handles bsmap sam files better,
and handles quality score conversions more flexibly. The old mode is still around called
sam-to-compact-old for comparison. The new mode has slightly different command line paramters.
- Added a discover-sequence-variants mode format 'methylation' to estimate methylation rates for RRBS and
Methyl-Seq alignments.
- Dramatically improved TMH loading times for large alignemnts.
- Completely removed support for queryLength in header. This usage was deprecated in Goby 1.7, complicates
the code unecessarily and is error prone (because we had two ways to store read length in the previous
versions of Goby). Note that versions since 1.7 had a concat mode that transfered information from the
header to the alignment entries transparently. Use this mode from a pre 1.9.4 release if you need to
migrate a 1.6- alignment to work with Goby 1.9.5+.
- Fixed a bug where merge-compact-alignments would throw an ArrayIndexOutOfBounds because a TMH
query index was smaller than the first query index in the alignment.
- Changed discover-sequence-variant mode to filter out alignment entries whose read mapped multiple locations in the
reference (as determined by the aligner argument (i.e., -n for gsnap)).
- Made AlignmentReader an interface. The previous AlignmentReader class is now called AlignmentReaderImpl.
- ConcatSortedAlignmentReader and ConcatAlignemntReader now support a configurable AlignmentReaderFactory.
The factory makes it possible to plug in alignment reads that filter entries as they are read. The default
factory returns all reads. However, if NonAmbiguousAlignmentReader factory is installed, the concatenate
reader returns only entries for which the read did not match other locations in the genome. Other filtering
behaviour can be implemented in a sub-class of AlignmentReader (see NonAmbiguousAlignmentReader for an example)
and a factory created to return instances of this class.
This mechanism is used to filter out entries whose reads match several locations on the reference sequence.
- Goby now includes a VCFParser class (see package edu.cornell.med.icb.goby.readers.vcf). VCF stands
for Variant Call Format. The VCF format is described at http://www.1000genomes.org/node/101.
The Goby VCFParser class implements a VCF 4.0+ parser. Importantly, this implementation also can be
used to parse plain TSV files, or VCF that do not include the fixed VCF columns. It therefore support
an extended version of the VCF format that is as generic as a TSV file, but can also provide meta-information
about the columns in the specific file. Another difference with VCF 4.0 is that we support the Group
attribute on column fields. This makes it possible to indicate that fields are part of the same group.
Such a feature can be used by user interfaces that would like to offer the ability to manipulate multiple
column fields as a group (for instance to hide or show an entire group of fields).
- FDR mode now supports VCF input files and outputs. See the option --vcf to activate processing of VCF formatted
files.
- Added a VCFWriter class to write files in the VCF4 format. This class is now used by discover-sequence-variants
when writing in genotypes format. This should make it possible to use vcf-tools on the genotype files produced.
- Fix logic for IterateSortedAlignments which, in turn, fixes sequence-variation-stats2. The issue primarily
dealt with insertions, deletions, and left and/or right padding.
- Fixed the logic for TAB_SINGLE_BASE in display-sequence-variation mode to report the correct
read_index and ref_position.
1.9.4
- The C API (used by BWA, GSNAP) has been updated to more accurately write sequence variations (this version
fixes problems in reporting of the read index). We have created examples of how sequence variations are
encoded in Goby alignment files. These examples are available at http://tinyurl.com/goby-sequence-variations
- Mode concatenate-alignments now propagates names and versions of the aligners that contributed input alignments.
- Mode sort now propogates the name and version of the aligner that produced the alignment.
- Mode compact-file-stats now reports the name and version of the aligner that produced a Goby alignment file.
- Mode discover-sequence-variants has been extended to support multiple types of outputs (see --format flag).
One output format prints genotypes (--format genotypes), while another estimates the proportion of the
reference allele in each sample (--format allele_frequencies).
- Added a mechanism to support base filters in discover-sequence-variants. To activate these filters, you must provide
the --eval option with the "filter" option. Two filters are currently active when --eval filter is used: one
filters variant bases by quality score (keeping only bases with q-phred>=30) and another is a simple and efficient
strategy to remove bases that do not quite agree across all the observations.
Future versions will make it possible to customize the set of filters and their options.
- sequence-variation-stats2 now runs in parallel up to the available number of threads when multiple alignments are
given as input.
- display-sequence-variations and sequence-variation-stats modes: Fix problems in the logic to calculate
read-index for large insertions/deletions.
1.9.3
- This release has a C API compatible with our development version of GSNAP. A version of GSNAP released
after 2011-03-11 should compile with Goby 1.9.3.
- Add new statistics for discover-sequence variants. Notably, we now record the log odds ratio,
the estimated standard error of the log odds ratio, as well as a Z-score for the log odds.
Standard error and Z-score are only estimated if more than 10 counts exist in each cell of the contingency table.
Also added the proportion of reference allele (refCount / (refCount+varCount).
- Fix reformat-compact-reads bug where quality scores where longer by 1 than the sequence.
- Reduce the memory needed by compact-file-stats to determine the number of reads in a compact reads file.
- Changed how the number of reads in an alignment file is determined by compact-file-stats. We now report the number
stored in the alignment header.
- Change how log2 fold change was estimated. We used to estimate as ((log2_rpkm_group_a+1)/ (log2_rpkm_group_b+1)).
This can cause problems when log2 rpkm are negative in one group and positive in the other. We now add 1 to counts
before calculating RPKMs and taking the log. Similar changes were done to the fold-change. RPKM columns now return
PRKM of (count+1).
- Mode reformat-compact-reads now takes an optional -f argument to filter reads. This option can be used to
remove redundant reads from a compact-reads file (see tally-reads mode to produce the read filter). It is no longer
necessary to do round-trips to fastq to remove redundant reads.
1.9.2
- Fixed a major bug in discover-sequence-variants that sometimes could cause confusion in the group of origin of a
variation. This bug could affect between group p-values. A Junit test now checks for the error condition and
is part of regression testing.
- sam-to-extract mode: append ".compact-reads" to output filename when the extension is missing.
- Added a mode to display aligned reads for a region of the reference sequences. The reads are written
in fasta format, suitable for viewing with a sequence alignment viewer such as JalView, CINEMA, etc.
The mode is called alignment-to-pileup.
- ConcatenateAlignmentReader would consume excessive amounts of memory when several large alignments
(e.g., with >100 million reads) were concatenated. The reader was trying to allocate very large queryLength
arrays, even though each underlying reader indicated that it its entries carried the queryLength.
The fix consists in detecting that all the concatenated readers support queryLength in entries, and
not allocating these arrays at all. This is a major bug fix that makes makes it possible to run more
instances of goby modes on the same server (i.e., differential expression and sequence variant discovery
modes have significantly improved memory usage).
- Mode sam-extract-reads now supports an optional --quality-encoding argument. Default is BAM encoding.
- QualityEncoding now supports BAM encoding (no offset or adjustment, the value of the character in
ascii is the Phred score).
- Fixed sam-extract-reads. Was not extracting sequences from BAM files.
- compact-to-fasta mode: now supports reading an arbitrary slice of input.
- sam-to-compact mode: draft support for importing SAM files produced by BSMAP.
- fixed a bug that prevented running sam-to-compact mode from command line. An assertion prevented the code
from running from the command line. Clarified the text of the assertion error and read the required parameter
from the command line argument so that the mode will run again on SAM files generated outside of Goby.
- reformat-compact-reads must trim quality scores in the same way that it trims the sequence. Quality scores
were not trimmed in previous versions. This is now fixed.
- reformat-compact-reads now correctly processes sequence pairs. Sequence pairs and quality scores can now
be trimmed in the same way as the primary sequence.
- Expose sampleFraction via API and command line for read-quality-stats mode
- Make fasta-to-compact mode more callable via API
- reformat-compact-reads during 'mutate' will no longer complain when there is no sequence-pair that it
cannot mutate (mutation will not be attempted nor complained about if sequence.length is zero).
1.9.1
- fasta-to-compact mode: fix bug that prevented checking that quality encoding are in the allowed range.
quality score must now be converted within the correct score range before the compact-reads file can
be written successfully.
- Paralellize the estimation of statistics. This can speed up mode alignment-to-annotation-counts.
- Introduced a field spliced_alignment_link and spliced_flags in AlignmentEntry to represent relation
between parts of reads that span exon-exon junctions.
- Introduced insert_size in Alignment entry to represent the size of the insert used when making
the sequence library.
- Introduced meta-data in compact-reads files. Meta-data provide a way to document how the sample
was opbtained. Suggested information to be recorded includes when the library was sequenced (useful
to detect batch-effect, as suggested by a participant to the SEQC meeting at the NIH Bethesda campus),
as well as sequencing instrument. Modes fasta-to-compact, compact-file-stats and reformat-compact-reads
have been updated to define, transfer or display meta-data when appropriate.
- Mode compact-alignment-stats now prints statistics about paired-end reads.
- Removed spurious SAM header when writing alignments in plain text format.
1.9
- New fdr mode provides a tool to combine tab delimited file where some columns contain P-values and
adjust selected P-values for multiple testing with the Benjamini Hochberg method. The tool is efficient
in that it only keep P-values that need to be adjusted in memory, but otherwise keeps other column on disk.
This strategy is expected to scale to hundreds of millions of lines of information.
- Add a way to open only a slice of an indexed alignment file by position. This feature makes it possible
to retrieve all alignment entries that start between specific position boundaries. See new constructor
in AlignmentReader and ConcatSortedAlignmentReader.
- The mode discover-sequence-variant has been updated to take advantage of the alignment position slicing
feature introduced in Goby 1.9. See the new arguments --start-position and --end-position.
- Fix a bug in skipTo that caused some alignment entries to fail to be returned (skipTo previoulsy ignored
entries that occured in the chunk just before where the index points). This behaviour is incorrect because
the chunk just before where the index points may contain entries with positions equal to the skipTo requested
position. The index contract is to return the chunk that starts with an entry with the requested location.
Because chunks contain multiple entries with increasing positions, the chunk immediately before the indexed
chunk must be scanned and filtered to remove entries with positions before the skipTo requested position.
A new test was written to check for this issue (TestSkipTo.testFewSkips4).
- Provide Building/Installation instructions for the Goby C++/C API.
- Implemented a fast concatenation operation for read files. The new -q flag in ConcatenateCompactReadsMode
activates the fast concatenation. Chunks of compressed data are appended without requiring decompression and
compression of the entries. This results in much faster concatenation that are bounded only by available IO.
- Add mapping_quality field to AlignmentEntry protobuf schema.
- Add aligner name and version in AlignmentHeader protobuf schema.
- Added C/C++ api methods to set aligner name and version, and alignment entry mapping quality.
- Updated the C API to be more generic, less oriented toward any one
particular 3rd party tool. The read-API is now more generic, the write-API
hasn't changed. The C API files, including the .h header files, have been renamed.
- In C_Alignments.c/.h & C_CompactHelpers.h added CSamHelper and samHelper_* methods to assist
with conversion of BWA to support CompactAlignments as the data stored in BWA just prior
to writing alignments is effectively already in SAM format. These methods make it possible
to reconstruct the aligned query and reference so data can be written in compact alignment.
- Goby C/C++ API now requires the pcre (regex) >=8.10 library. See http://www.pcre.org/
- Compact alignments now support paried-end alignments in Java / C++ / C APIs.
- In alignment-to-text mode, output support in PLAIN and SAM for Paired End alignments
- in alignemt .stats file rename the stat "number.aligned.reads" to the more accurate name
of "number.alignment.entries" for both the Java API and the C++ api.
1.8
- C API introduced to support native Goby support in GSNAP.
- We now distribute a subset of Goby as the Goby IO API. This subset is packaged in the goby-io.jar
file and released under the LGPL3 license. This was done to make it possible to include Goby format
input output code directly into other software licensed under the LGPL3.
- Fixed a bug that prevented Goby opening large alignment files (>3Gb).
- Fixed a bug in AlignmentIterator triggered when reading alignment files with targetIndices starting at
numbers larger than zero.
- Removed dependency on colt (because it is not a pure LGPL license by adding restriction in military
applications)
- SGE helper scripts bz2compact.sh and keep-unique-reads.sh help process hundred of lanes in
parallel on an SGE grid. bz2compact extracts fastq files compressed with BZip2 and converts
them to compact-reads format. keep-unique-reads.sh determines the set of reads that are unique
in each input <file>.compact-reads and writes this information to a <file>.uniqset-keep.filter
- Mode concatenate-compact-reads now supports read index filters. This makes it possible to
concatenate and keep only reads that are unique within each file.
- Draft helper to iterate through individual reference positions of a sorted set of alignments
(see IterateSortedAlignments).
- Alternative implementation of sequence-variation-stats mode (called sequence-variation-stats2)
that determines the number of reference bases matched at a given read index. This info is needed
to call sequence variants, but slows down the stats. The initial implementation is preserved for
compatibility.
- New mode discover-sequence-variants will either (i) identify sequence variants within a group of sample
or (ii) identify variants whose frequency is significantly enriched in one of two groups.
This mode requires sorted/indexed alignments as input.
- SamToCompact mode now populates the read quality scores for sequence variations (toQuality field).
- Update picard/samtools to version 1.25.
- In the mode "alignment-to-annotation-counts" the "--eval" options supports
a new value "counts" which will output a format specifically designed
for use with R's DESeq and notably for the R script geneDESeqAnalysis.R
which is used with GobyWeb.
- Fix bug in extract sequence variations for SAM format, where matches on the
reverse strand got a read-index larger than one from the correct value.
- By default, don't use "counts" in DiffExp as it is a specialized output for preparing for DESeq.
- API interface for ReadsToWeightsMode.
- LastToCompactMode wasn't writing target lengths. Fixed.
- Read TMH in Python using Gzip.
- Fixed Python utilies so -o actually writes to a file.
- Added transcript-align.sh script to assist with aligning via transcripts.
- In MessageChunksWriter, flush logic should occure on a COMPLETELY empty file, but otherwise it
should only occure if entries have been added since the last flush(). In both C++ and Java.
- DiffAlignmentMode can better compare differences when alignments were done by two different
aligners and the Target Indexes are the same in label but not the same TargetIndex
by building a master TargetIndex and translation maps for the two different alignments.
Targets are now shown by label name instead of TargetIndex.
- CompactFileStats --verbose on a compact alignment shows the targetIndex -> targetIdentifier
map and also displays the targetLength for that targetIndex.
1.7
- Extended fasta-to-compact and compact-to-fasta to handle paired end runs. See new command
line arguments --paired-end and pair-indicator arguments in fasta-to-compact and
--pair-output argument in compact-to-fasta.
- Draft support for paired sequence runs. The compact file format is extended to store
sequence, sequence length and quality scores for the paired run. This extension makes
it possible to store both paired end runs in a single compact file. This should help
keep the data together.
- Implemented translation back and from Solexa quality score encoding in fasta-to-compact
and compact-to-fasta. Thanks to Cock PJA et al NAR 2010 for the clear description of the
Solexa base quality scores.
- The sort mode now supports reading only a slice of an input alignment (see options
--start-position and --end-position).
- Refactored CompactAlignmentToAnnotationCountsMode to use IterateAlignments (provides
large speed ups when working with sorted/indexed alignments and selecting a subset of
reference sequences for DE).
- IterateAlignments now takes advantage of the skipTo method when the alignment is sorted
and indexed. This provides large performance improvements when one needs to access data
for only a few reference sequences in an alignments. All the modes that use
IterateAlignments benefit, including display-sequence-variations, and
sequence-variation-stats.
- Index alignments that are sorted upon writing. The skipTo method leverages the index
to provide fast semi-random access to entries by genomic location. This feature is used
by the IGV Goby plugin, which requires Goby 1.7+.
- Concatenate alignment now produces sorted alignments if all the input alignments
are sorted.
- Added a mode to sort alignment by reference sequence and then by position
on the reference sequence.
- Support to estimate read weights described in Hansen KD et al NAR 2010.
See http://campagnelab.org/software/goby/tutorials/estimate-heptamer-weights/
In contrast to the initial publication, Goby supports using the weights to
reweight annotation counts and transcript counts.
- Support to estimate GC content weights for reads and to reweight raw counts to
remove the dependence of counts on GC read content.
- Preliminary support for barcoded reads (barcodes in the sequence), see new
mode decode-barcodes (and tutorial online at
http://campagnelab.org/software/goby/tutorials/handling-barcoded-reads/).
- alignment-to-*-counts: New --eval argument allows to specify which statistics
to evaluate when comparing samples.
- alignment-to-*-counts: New eval options 'samples' will write a column per sample
for RPKM, log2(RPKM) and raw counts. RPKM and log2(RPKM) are written once per sample
and global normalization method.
- Reduce memory requirements when concatenating many alignments. A change
introduced in 1.6 caused more memory than needed to be allocated for each
split of an alignment (as much as the number of reads in the file that
was split). Each split now uses only as much memory as needed to keep
query lengths for the split.
- Dramatically improved performance for differential expression tests with millions of
differentially expressed elements (e.g., exon+gene+other). The code previously
incorrectly grew internal arrays from zero to the number of new DE element described
in the annotation file.
Changes that impact the compact alignment format:
- The compact file format is extended to store sequence, sequence length and quality scores
for the paired run. This extension makes it possible to store both paired end runs in a
single compact file. This should help keep the data together.
- Moved query lengths from header to alignment entries. This scales much
better when processing large alignment files (generated from more than
a few hundred million reads).
- The optional 'sorted' attribute in header indicates if an alignment has been sorted.
1.6
- First draft of the Goby Python API and demonstration tools (see
directory python).
- Fix bug where compact file stats mode reported that a compact alignment
had query identifiers but actually did not
- Added within-group-variability mode. This mode estimates Fisher P-values
between pairs of samples taken from a group of homogeneous samples.
Summary statistics such as average p-value, or minimum p-value are
reported for each gene in each pair considered.
- Update JRI.jar to version 0.8-4 which now works properly with 64-bit
Windows.
- Update commons-lang to version 2.5.
- Optimized DE type storage.
- Fixed a race condition in CompactAlignmentToAnnotationCountsMode.java
when running in parallel by moving .reserve() out of the for loop.
- Renamed DifferentialExpression.ElementTypes enum to ElementType
- Fixed a bug in the DifferentialExpressionCalculator which reset
ElementType for a value from the actual value to OTHER (in occurred
in CompactAlignmentToAnnotationCountsMode). Now once ElementTypes
is set for a label it cannot be changed.
- CompactFileStatsMode now supports an optional -o to write the output
to a file. If not specified the output will be written to stdout.
- Reformat reads now preserve read indices from the input file.
This is necessary when using concat alignment with
--adjust-query-indices false
1.5
- Added a mode to calculate counts and perform differential expression
analysis for transcript runs (alignment-to-transcript-counts).
Transcript runs are performed against a cDNA library. They find matches
through through exon-exon junctions represented in the input cDNA
library. They are a faster alternative to mapping the genome and
exon-exon boundaries separately. Disadvantage is that these searches
will only map to transcripts represented in the input library.
- Changes to fasta-to-compact mode:
- Add parallel processing in fasta-to-compact mode. Use the --parallel
flag to activate.
- Will now only regenerate compact-reads that do not
exist, or are older than the input file.
- Added a mode to write a read set to text format (set-to-text). The output
will show the multiplicity of each query index. ReadSets can be
efficiently created with tally-reads as before.
- Changes to CompactAlignmentToAnnotationCountsMode
- Added new option --write-annotation-counts boolean, defaults to
true. If set to false the annotation counts intermediate files
will not be written.
- Lines where "average count group *" values are ALL NaN or <= 0 will
not be written. This makes it so lines that don't add anything to
the output are just omitted.
- Added new option --omit-non-informative-columns, defaults to false.
If set to true, columns in which all of the data is non-informative
(values are ALL NaN or <= 0) will be omitted.
- Support for alternative global normalization methods. We currently
provide an implementation of the upper quartile normalization method
by Bullard et al (BUQ) and the normalization method provided in
Goby 1.4 (CAC, normalize by the number of alignment record in a sample)
See the --normalization-methods argument. New normalization methods
can be used with Goby by creating an implementation of the
NormalizationMethod interface,
and adding a jar on the classpath that defines a ServiceProvider
(see build.xml goby-jar target for an example of how this is done).
When several normalization methods are given as an argument
to --normalization-methods Goby will produce derived statistics
for each normalization method and append them as new columns in
the summary stats output. This makes it easy to compare alternative
normalization methods on the same dataset.
- Added support for sequence variations:
- Changed the compact alignment format to support recording sequence
variations.
- The new mode display-sequence-variations provides text output of
sequence variations in several formats.
- The new mode sequence-variation-stats will print statistics about
sequence variations found in a set of alignments.
- Added support for quality scores:
- Changed fasta-to-compact and compact-to-fasta to read and write with
the Sanger or Illumina quality encoding.
- Modified aligners to indicate which format they require (bwa needs
fastq format, lastag fasta format, lastal fastq format). This will
need extensive testing as some of these changes can affect gobyweb.
We use the FASTQ-SANGER encoding to communicate with lastal.
We don't yet support the Solexa quality score encoding (it is a bit
obsolete anyway).
Please note that the output format in compact-to-fasta now defaults to
Fasta format. This format has no quality scores, and consequently, we
now never write quality scores when Fasta is requested. The aligners
that need quality scores must request FASTQ format explicitly.
See also:
http://en.wikipedia.org/wiki/FASTQ_format
http://maq.sourceforge.net/fastq.shtml
http://last.cbrc.jp/last/doc/last-manual.txt (look for FASTQ-SANGER)
- Changes to the Compact format:
- Store target/reference sequence lengths in the alignment header. This
information is helpful when calculating statistics such as RPKMs
(transcript-level searches).
- Store constant query lengths as one integer. Goby 1.4.1 stored one
length for each read. This can become very memory consuming when the
number of reads is very large. This change saves memory and storage.
1.4.1
- Added a mode to write a read set to text format (set-to-text). The
output will show the multiplicity of each query index. ReadSets can
be efficiently created with tally-reads as before.
1.4
- Last aligner (http://last.cbrc.jp/) is now supported "out of the box".
Tested against version last-96. Support for the enhanced version
"lastag" still exists.
- Alignment-to-annotation-counts mode now computes a p-value using R
(if available on the host)
- Update to protobuf 2.3.0 (http://code.google.com/p/protobuf/)
- Default extension for files written in Wiggle Track Format is now ".wig"
for easier integration with the Integrative Genomics Viewer
(http://www.broadinstitute.org/igv/).
Similarly, the default extension for BedGraph Track Format files is
now ".bed".
1.3
- New "counts-to-bedgraph" mode which is similar to "counts-to-wiggle" but