CHANGES.txt

3.3.1
 - Fixed alignment concat where results could be truncated if several empty slices followed one another (e.g., if concat
 A,B,C and A and B are empty, goby ca could yield an empty alignment, completely omiting alignments in part C.)
3.3.0
 - Substantially reduced memory utilization for discover-sequence-variant (all modes).
 - discover-sequence-variant could in some rare cases output the same base twice (when indels were extending prior to
   the beginning of the read after equivalent indel region calculation). This fix improved indel performance when
   training models with variationanalysis 1.3.3+.
 - Initial work to develop[ models for genomic segments (see .ssi format and consurrent work in variationanalysis).
   This is work in progress. Protobuf schema is in goby-io/protobuf/SegmentInformationRecords.proto
   Models are developed in parallel with Keras (in goby3/python/dl) and DL4J (in variationanalysis).
 - Updated genotyping model to state of the art (models/genotyping/1510204519948/, see evaluation results in the folder)
3.2.7
 - Somatic output format: report predicted somatic allele in VCF.
 - Variant.FromTo: defined SerializeID. This requires regenerating varmaps.
 - sbi output: Set position and reference base on list copy. Fix for reference base begin '\0' in sbi files.
 - vcf-to-genotype-map: Fix VCF to varmap. Incorrect genotypes added prior to this commit (since refactoring
   to use VCF reader from HTSJDK in version 3.2.6).  Show better statistics when creating the map. Fix for
   indels not imported in varmap.
 - GenotypesOutputFormat: Complete rewrite Fix VCF coding of het sites. Also, when using a model, now we check sampleCount,
   in case the model does not use the matchesRef feature, because such models may return a default non-reference
   base for sites with no coverage.
 - Add usage to goby wrapper. Do not attempt to configure R unless the variable GOBY_USE_RJAVA is configured.
3.2.6
 - Updated models for compatibility with latest code: genotyping model and somatic models are updated.
 - Tested that models produced with variationanalysis (genotype and somatic) load in Goby and can be used
   with the modes to generate VCF.
 - Various bug fixes to last-to-compact mode. Bugs were triggered by output from more recent versions of Last than
   tested previously.
 - Discover-sequence-variations mode: fix VCF output for indels. Genotypes format mostly rewritten.
   Was previously writing incorrect indels. Latest code produces VCF files tested for compatibility with RTG vcfeval.
 - Discover-sequence-variations mode: Add minimum-P and stringent-P options to Genotypes output format.
 - Rewrote VCFToGenotypeMapMode to use HTSJDK VCF parser. This should enable using BCF files as input as well.
 - Fix for count of indels. The first equivalent indel region did not increment the count.
   Counts on forward and reverse now match the number of supporting entries on each strand.
 - Add supporting entry the first time an indel is created in a SampleCountInfo. The supporting entry was not set
   on the first one.
 - Apply count fixer to remove bases matching ref from list, when the mandatory filter has determined the base should
   be removed. Previously was only removed from counts, but not from list of bases. One possible candidate for indel
   performance problems we have tried to fix for a while.
3.2.5
 - Fix issue with toProto that prevented using more than one sample for genotyping with goby.
 - alignment conversion to goby: ignore missing MD tags (it is possible only some reads are missing them and we
 still need to convert the other aligned reads).
 - Upgrade goby to DL4J 0.8.0.
 - fasta-to-compact: Do not use an assertion, but instead reset read index to zero and explain how to avoid the
 problem.
 - SBI format: add distance from start of read and end of read. Will be mapped to a density in next genotype mapper.
 Should help variationanalysis models detect cases where end of alignment is fully contained within homopolymer region.
3.2.4
 - Fix tally-reads mode.
 - Some fixes to realignment of SNPs around indels.
 - improvements to barcode remover (to trim bases from 5' end before removing barcode).
 - Goby version now reports the commit that produced the distribution.
 - Goby version, including commit now written to generated .sbi files.
 - Introduce CommitPropertyHelper to record the specific commit that produced the version of Goby being used.
3.2.3
 - Fix SNP bug in realignment around read insertion.
 - Add queryPosition field to SBI output.
 - Prevent the writing of sbi entries when AddTrueGenotypeHelper indicated the entry should not be added.
3.2.2
 - Fix frequency of bases when indels are also present. Now correctly removes bases that
 support the flanking sequence of the indel and do not double count.
 - Many changes to how we store varmaps introduced to support indels (vcf-to-varmap).
 The serialization format is incompatible with previous versions, so make sure you regenerate
 varmaps from VCF.
 - Adjust VCF output for compatibility with REF/ALT conventions. This makes it possible to measure
 performance with standard tools such as RTG vcfeval (http://realtimegenomics.com/products/rtg-tools/).
 - Keep counts of indels separately for forward and reverse strand.
 - vcf-to-varmap mode: improved semantic of --chromosome-prefix option allows removing (e.g., -chr)
  or adding (+chr) prefix to chromosome name.
3.2.1
 - fast-co-compact: fix a bug introduced on 10/6/2016 which created negative read entries.
 - catch a number of exception that can be thrown by HTSJDK when processing BAM files. Exceptions
   are caught so that an error on one alignment does not interrupt processing of an entire alignment.
   Errors are shown in log.
 - vcf-to-genotype-map mode now supports (b)gzipped vcf input.
 - vcf-to-genotype-map: fix bug that manifested itself when the vcf had a single genotype field.
 - vcf-to-genotype-map: add chromosome-prefix argument to help import VCF where the chr prefix is missing.
3.2
 - Remove memory leak when reading SAM/BAM files. This was the likely cause for running out of memory error in
 compression benchmarks (had nothing to do with compression but with the conversion of SAM/BAM to goby representation).
 - Disabled tests that could not succeed anymore (because of choices we made in Goby 3, such as lack of auto-upgrade
 for alignments produced with Goby 1 and 2.)
 - BAM/CRAM support. Added an option to bypass the header check on SO:COORDINATE. Use
   -x HTSJDKReaderImpl:force-sorted=true to force Goby to consider an alignment sorted.
 - SBI format: add ability to add true labels while writing the file. Add support for downsampling sites without
   variants.
 - Genotype format: reorganization to support calling with deep learning models trained with variation analysis.
3.1
 - Reorganize model prediction to facilitate installing new versions of the variationAnalysis jars.
 Goby 3.1 is now compatible with variationanalysis 1.1.1.
 - Replace models with versions trained with variationanalysis 1.1.1.
 - Add somatic mutation models trained with whole genome data (ICGC GoldSet).
3.0.0
 - Support reading BAM alignments directly with Goby APIs.
 - Support probabilitic models for calling somatic variations, trained with deep learning.
2.3.6
 - Improve performance of realignment around indels when processing RNA-Seq reads. Previous versions of Goby had
   scalability issues and kept data around from previous chromosomes. This was OK when processing DNA-Seq inside GobyWeb,
   which splits data into genomic slices, but not when trying to process one or more RNA-Seq alignment files.
   Performance has also been dramatically improved by fixing a bug on indel equality.
2.3.5
 - Add a mode to infer sex of samples from data (tested on exome data). Useful as quality control to check the
   data you get checks out with respect to the what is known about the samples. See --mode infer-sex. Works
   faster on sorted alignments where the index is used to jump quickly to the human sex chromosome.
 - Prevent AbstractAlignmentToCompactMode to print more than 10 warnings if quality scores are not available in
   an alignment.
 - suggest-position-slices: fix a bug in that caused some slices to overlap. Found with a job with hundreds of
   alignments, so not common.
2.3.4.1
  - Add an option to the fasta-to-compact mode that will convert a set of files and concatenate the result
    to a single compact-reads file (see new --concat option).
  - Add a mode to test that the connection from Goby to R is working (requires JRI and R built
    with shared library support). The mode is called test-r-connection (tcr).
  - Restore STRICT_SOMATIC filter.
  - Close files opened when loading Goby Alignment header and index files. This fixes a too many file error
    that could occur when loading hundreds of alignments simultaneously.
  - Allow lenient import mode for TSV files. This makes it possible to convert TSV files to lucene.index when
    they have been created with Goby in the past with a \t character as last character of the column line.
  - Fix a bug that caused some slices to occur within annotations, despite the --annotation option being given
    on the command line. The problem was that the chromosome index was not /obtained from the genome and was set
    to zero, always.
2.3.4
 - Optimize the speed of genotyping when some sites have very high coverage (>500M bases).
   Now sub-sampling to keep a random set of 10,000 bases for such sites. Expose the default
   sub-sample size with a dynamic option called sub-sample-size in IterateSortedAlignmentsListImpl.
   (-x IterateSortedAlignmentsListImpl:sub-sample-size <int>)
 - LastToCompact mode now supports the import of paired end alignments produced by Last's last-pair-probs.sh.
 - LastToCompact mode now supports the import of quality scores (lastal must be done with -Q1 since the
   import assumes Phred quality scores on the q lines).
 - Add two methods to AlignmentReader to determine the minimum and maximum genomic locations represented
   in the reader. This is useful when suggesting slices to split a set of alignments. This commit includes
   a fix for possible null start or end positions in slices generated with suggest-position-slices.
 - Fix a problem with run-in-parallel where some threads would never finish when they do not detect
   the keyword. Now indicate that the thread finished so that others can start when the processing
   completes.
 - reads-file-stats: remove any path from basename in the output.

2.3.3
 - IterateSortedAlignmentsListImpl: Use a WarningCounter to limit warnings to 10 instances. This is needed to
   avoid writing Gb of log output when the threshold is met.
 - discover-sequence-variants somatic output: Make it possible to run a simple trio design by removing the
   requirement for a germline sample.
 - discover-sequence-variants somatic output: Earlier versions were reporting somatic variation candidates
   when two parents are homozygotes and the somatic samples was Het (the fisher p-value with each parent is
   very significant in this case, but does not indicate a somatic change). This also improves q-values because
   they are less results that need to be corrected.
 - discover-sequence-variants somatic output: Add an error message when a sample is mis-spelled in the covariates
   file.
 - Refactor code base to keep base counts for forward and reverse strands separately in SampleCountInfo.
 - Normalize somatic priority score by number of mapped reads, and number of parents and germline samples used in
   the calculation.
 - Add a StrandBiasFilter in somatic analyses. The filter rejects variations that are not represented on both
   strands when at least j reads support the variation. The value of j is set to 9 by default, so a variation with
   10 bases needs to have at least the two strands represented.
 - Remove candidate somatic variation that can occur when the germline samples have less coverage than the
   somatic sample. Now require at least twice the coverage in the somatic sample than the minimum coverage
   in the germline samples.
 - Add a STRICT_SOMATIC filter that flags genomic sites where some bases appear in support of the variation
   in the parents or germline samples. Please note the VCF spec semantic: PASS indicates that all filters passed.
   This means that lines with the STRICT_SOMATIC value in the FILTER column failed that test.
 - Fix a bug in FDR mode that would not handle vcf files with non default FILTER values.
2.3.2
 - run-parallel-mode now supports paired input files.
 - fasta-to-compact: add --force-quality-encoding option to force the quality values within the specified
   encoding range.
 - suggest-position-slices: fix problem where first slice of genome was omitted from output (with new split
   by number of bytes option introduced in 2.3).
2.3.1
 - Fix for https://github.com/CampagneLaboratory/goby/issues/3
 - Upgrade commons-io and dsiutils to latest jar versions. Log messages when scanning reads file with cfs mode.
 - DistinctValueCounterBitSet: now grows to biggest size at construction time.
 - Fixed a performance problem. When reading large reads file (>10GB), performance of ReadsReader would degrade
   over time. This was due to caching of data in static protobuf methods of ReadCollection. We now create a
   builder instance that gets garbage collected when it is no longer used. This fixes a subtle performance
   problem. The same fix has been applied to alignment readers.
2.3
 - concatenate-alignments mode: add ability to restrict output to a genomic slice (see -s and -e options).
 - API change: AlignmentSliceHelper makes it easier to parse and process genomic slices for sets of alignments.
 - concatenate-alignments mode: now transfers read groups to output in the same way that non-sorted concat does.
 - concatenate-alignments mode: Add a mechanism to override/define read groups/read origin info on the fly when
   reading alignments that did not include them. Coupled with changes to compact-to-sam, this makes it possible
   to get BAM files with read groups directly from Goby alignments.
 - compact-to-sam mode: fixed output of read groups, which were not correctly written for platform, platform unit,
   and library.
 - suggest-position-slices: add --restrict-per-chromosome option.  When this switch is provided, slices will be
   restricted to start and end on the same chromosome. This is useful to produce intervals to give Mutect,
   for instance.
 - Trim mode: add --trim-left --trim-right parameters to control trimming of specific sequence extremities.
 - Trim mode: add --verbose flag.
2.2.1
 - FDR mode: add ability to read groups from VCF file and adjust columns/fields marked as p-value. Mark adjusted
   columns with group q-value.
 - Somatic variation output format: annotate somatic p-value column with 'p-value' group. Fix the type of the p-value
   column to be a number (was String in release 2.2).
 - Somatic variation output format: handle unrecognized sample-ids in the parents column.
 - discover-sequence-variants mode: add assertion to give hint to user that syntax is incorrect in for -s and -e options.
 - compact-file-stats mode: print progress when scanning reads files. Use a buffered reader to improve read file
   parsing performance.
 - discover-sequence-variants: adjust multiplier for left-over filter for somatic variations output format.
 - discover-sequence-variants: Add a new filter to remove indels at a site where a sample shows lots of distinct
   possible indels. Indels at these sites are very likely to be artefactual. We count the number of samples where
   three distinct indel genotypes are seen. If more than 1/4 of the samples have likely indel artifacts, we remove
   all indel candidates at the site. maxIndelPerSite:Maximum number of distinct indels at a given genomic site.:1
   Additional filter: fractionOfSamples: Maximum fraction of samples that can have an indel candidate for the indel
   to be considered (indel candidates that occur in many samples are more likely to be spurious).:0.25
   This filter is added to the somatic variations output format. See dynamic options for this filter with --x-help
2.2
 - Remove threshold effects when calling genotypes in several samples. Modified the filters to not remove bases in
   specific samples when the genotype survived filters in at least another sample (previous versions reported these
   threshold edge effects as differences, which could be confusing, this version simply shows the marginal raw base
   counts in samples where the genotype could have been filtered by a filter, which makes it easier to compare the
   strength of the genotype support across samples). This adjustment was done for both base genotype and indel genotypes.
 - LeftOverFilter: now uses minVariationSupport as minimum threshold.
 - Mode suggest-position-slices: add option number-of-bytes to suggest slices with a uniform number of compressed
   bytes. This option aims to provide more balanced slices in bases where the genome as very non uniform coverage
   by position. With this option, the number of slices is determined to yield slices that need to decompress about
   the amount of bytes indicated on the command line.    `
 - Framework API change: introduce class PositionToBasesMap<T> to use as type for positionToBases. The class provides
   methods to get the range of positions described in the map. This unfortunately requires changes to all clients/
   implementations of IterateSortedAlignments<T>.
 - Mode discover-sequence-variants: Fix various problems that prevented reporting genotypes for deletions (i.e., C/-).
 - Fix a potential NPE in GroupAssociations when samples are null.
 - Fix for issue #2, see https://github.com/CampagneLaboratory/goby/issues/2
 - Expose comparator in SortedAnnotations.
2.1.2
 - Upgrade xstream to version 1.4.3. This fixes the compatibility problem seen when running goby 2.1.1 with java 1.7+.
   Goby 2.1.2 should run with Java 1.7+, but more testing will be needed to rule out other migration problems. If you
   are running JDK 1.7+ please let us know any issues you encounter.
 - Fix VCFParser issue https://github.com/CampagneLaboratory/goby/issues/1. The issue could be triggered when the FORMAT
   column changed from line to line.
 - VCFWriter: improve support for VCF group associations. The Goby VCF parser makes it possible to associate columns
   to groups (these associations are written in a ##FieldGroupAssociations field).
 - Methylation rate VCF output: mark the context column with group 'indexed'.
 - Do not try to upgrade alignments when reading the header to concatenate permutations. This is not necessary and can
   open too many files when we are trying to concatenate alignments.
2.1.1
 - Add extract-splicing-events mode. This mode is used by GobyWeb 1.9 to extract splicing events from spliced
   Goby alignments (generated either by GSNAP or STAR at this time).
 - Trim mode:Fix bug that caused quality scores to be duplicated (the bug triggered the assertion that checks
   that sequence length equal quality length).
 - Trim mode: Some sequence must remain after trimming to append to the output.
 - Fix bug in alignment-to-annotation-counts when counts would be zero for samples whose name contained a
   period '.' The code was incorrectly stripping alignment extensions twice.
 - alignment-to-annotation-counts: add comparison description to t-test statistic column name (e.g. t-test[A/B] rather
   than t-test). This change makes it possible to retrieve the t-test p-values when more than one comparison is
   performed.
 - Fix a bug where RandomAccessAnnotations could return results on a different chromosome.
 - Add annotation loading test and fix for when annotation file is truncated. Goby now loads annotations up to
   the truncation and logs truncated lines.
 - Correct calculation for fold-change-magnitude column in goby diff exp mode. Previous calculation under-estimated
   magnitude when comparing low rpkms.
 - Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension
   (this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not
   return 404 errors for missing content).
2.1
 - Improve compression of hybrid-1 codec by about 8% on average at similar speed. You can enable this improvement with
   option -x AlignmentCollectionHandler:symbol-modeling=plus. This option will be made the default in a future release.
   It is not currently the default since Goby 2.1 has not been integrated into IGV and will need time to propagate from
   IGV dev to production builds.
 - Remove import of NH:i bam tags as read-origin-index, since the NH tag seems to contain different types of data
   depending on the aligner that produced the alignment.
 - compact-to-sam mode: fix bug where bam tags containing a colon character (:) would be truncated after the first
    colon. Thanks to Vadim Zalunin for reporting this problem.
 - compact-file-stats: Add a feature to scan only alignment headers.
 - VCFParser group associations: Make it possible to lookup an INFO column by either INFO/colname or colname.
 - NonAmbiguousAlignmentReader: fix an NPE when reading alignments where all entries have the ambiguity field.
 - Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension
   (this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not
   return 404 errors for missing content). Thanks to Jim Robinson and Helga Thorvaldsdottir for reporting this issue.
2.0.1
 - Release Goby C/C++ APIs under the LGPL license version 3 to make it possible for companies to incorporate support
   for Goby formats in their tools. Thanks to Collin Hercus for the suggestion. Please note that part of the Goby
   Java APIs are already licensed under the LGPL (anything packaged under the Goby-io.jar file).
 - C++ API: Support to set placed unmapped (i.e., mate that does not map is recorded with the read that mapped)
   and clipleft/clipright with quality scores.
 - Fix problem when using a genome backed by a samtools/picard faidx file. In some cases, read bases would be returned
   shifted by one position. Thanks to James Bonfield for reporting this problem.
 - SAM/BAM tags start at column 12, index 11. --preserve-all-tags could skip the first tag on some datasets (e.g.,
   dataset where the first tag was not a MD:Z or RG:Z). Thanks to James Bonfield for reporting this problem.
 - Introduce interface for ReadsWriter. Introduce mock implementation to write reads to text. This is useful to write
   more intelligible JUnit tests.
 - mode sam-to-compact now supports option --read-names-are-query-indices to indicate that the read names are integers
   (typically produced by compact-to-fasta from a chunk of a large file).
 - Fix a bug in reformat-compact-reads which did not trim quality scores for paired end reads correctly.
2.0
 - Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
 - Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
   compact-to-sam.
 - Refactor AlignmentWriter to introduce an interface and make it easier to create facades that modify the behaviour
   of the default writer. For instance, such a facade is BufferedSortingAlignmentWriter, which keeps a number of entries
   in memory to re-sort these entries by genomic position. This feature is used when importing already sorted SAM/BAM
   files to create sorted Goby alignments and the files contain spliced alignments that would cause mis-ordering during
   conversion.
 - Make default chunk-size dependent on the type of chunk codec used. This is useful because hybrid compression does
   better with larger chunk sizes (default chunk size for hybrid is 30000, 20000 for bzip2 and 10000 for gzip). The
   default chunk size can be overriden with -x MessageChunksWriter:chunk-size=int
 - Add ability to preserve SAM/BAM read groups. Read groups are automatically preserved if present in the input BAM file.
   The concatenate mode automatically reassigns read_origin indices (see field read_origin_index) to prevent conflicts
   when Goby files from different origins are concatenated. The approach we use is to keep the most specific read origin
   information, and let the client decide what origins/groups are equivalent given the type of analysis at hand.
   Read groups are supported by the hybrid codec (and therefore stored very efficiently), are imported from BAM with
   sam-to-compact and are exported back to SAM/BAM with the compact-to-bam mode.
 - Add ability to preserve all BAM attributes during import and export. Use --preserve-all-tags in mode sam-to-compact
   to enable this.
 - Add ability to preserve all quality scores. Use --preserve-all-mapped-qualities in mode sam-to-compact.
 - Supports bzip2 compression in fasta-to-compact mode and sam-extract-reads (use the -x MessageChunksWriter:codec=bzip2
   dynamic option).
 - Renamed SortMode to Sort1Mode. Renamed SortLargeMode to SortMode.
 - Added SortLargeMode which can sort compact alignments of any size, multithreaded.
 - Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode
   throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the --sorted option
   would not work in some 1.9 versions of Goby after samtools/picard changed the semantic of the record comparator Goby
   relied upon to verify the input was indeed sorted by position. This made it impossible to convert already sorted BAM
   files as sorted Goby alignments).
 - Moved error messages produced when parsing the command line of a mode to after usage. This is a simple change that
   will make it easier to diagnose problems on a command line without having to scroll back up the console.
 - Prevent logging when the log4j system has not been configured. For some reason, LOG.isDebugEnabled can return true
   when the logging system is not initialized. For SamHelper, this means calling String.Format million of times to
   create debug output that is never shown. This change dramatically improves the performance of the sam-to-compact mode
   when logging is not properly configured.
 - Refactor dynamic options with a central registry, and make GobyDriver handle option parsing.
   This removes  duplication of code parsing for each mode that would need dynamic options.
 - methylation region can now estimate empirical p-values. Empirical P-values require biological replicates in at least
   one of the groups under analysis. Two passes over the data are required. In the first pass, the empirical null
   distribution is observed by comparing pairs of samples in the same group. In the second pass, this distribution is
   used to estimate the p-value of observing the between group differences. Such empirical p-values can control FWER
   in the strong sense.
 - Support empirical p-value for individual bases (VCF output). Write a DMR INFO field that stores how many significant
   sites were found in a moving window that ends at the site (significance is judged according to a configurable
   threshold on the empirical p-value).
 - New empirical-p mode to estimate p-values from data in text files. This makes it easier to derive p-values for
   simulated data or counts generated by other tools than Goby.
 - Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby
   tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files
   currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely.
 - Fix issues with the barcode-decode mode. Add support for processing fasta/fastq files.
 - vcf methylation format: removed space in name of C and Cm group INFO fields.
 - Add a draft implementation of random access sequence interface that can read a fasta file indexed with faidx.
 - Introduce chunk codecs for protocol buffer encoded collection messages (supports both reads and alignments).
 - Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and
   to limit the number of alignment entries to output (-n).
 - The RandomAccessSequenceCache had problems with bases that weren't G/A/T/C/N. Such bases would be skipped silently,
   causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a
   R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following
   the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following
   bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in
   this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem.
 - last-to-compact: add option to substitute some bases with others in the aligned read.
 - Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a
   slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the
   method did not enforce slice limits.
 - The code base was simplified by removing the now obsolete align mode.
 - Fix a problem where sample names with several dots were stripped of too many extensions. For instance, a.b.c.entries
   would be reduced to a, which could be non-unique across the remaining samples. Problem reported by Fang Fang in her
   data on GobyWeb.
 - DistinctIntValueCounterBitSet now uses LongArrayBitVector as its bit set implementation. The java BitSet implementation
   was found to throw java.lang.ArrayIndexOutOfBoundsException for indices that should fit easily in a bit array (e.g.,
   2,080,948 which can stored with about 230 MB).
 - AlignmentEntry field insertSize is now stored in protobuf with sint32 rather than uint32 since negative values can be
   stored in this field.
 - Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
 - The mode sample-quality-scores now supports .sam, .sam.gz, and .bam files to make a guess at the scale of
    the quality scores contained in the file.
 - Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
   compact-to-sam.
 - Fixed a problem with concatenate-compact-reads that previously transferred only specific fields of a read to the
   output file. concatenate-compact-reads now transfers all fields (including pair sequence and quality score).
 - version mode now prints an official version number if the jar constains a VERSION.txt file.

1.9.8.3.1
 - Fix a bug related to writing paired end alignments in the Gsnap parser (C API)
1.9.8.3
 - Added a methylation_region format capable of averaging methylation rates for different cytosine contexts over
   arbitrarily defined regions.
 - Added a diploid genotype filter to use when calling genotypes in a diploid genome.
 - discover-sequence-variants format compare_groups: Write distinct fisher p-values for each comparison pair
 - Fix FDR mode output for TSV format. Make open --column-selection-filter work.
 - Fix bug that prevented methylation vcf output from writing any line.
1.9.8.2.1
 - Fix bug in GenotypesOutputFormat that caused GenotypesOutputFormat to throw an exception when processing some sites.
1.9.8.2
 - Make it possible to activate indel calling without recompilation. Mode discover-sequence-variants now accepts
   the boolean argument --call-indels true/false.
 - Preliminary support for calling indels with discover-sequence-variants.  Candidate indels are now written
   in the formats that use GenotypeOutputFormat (e.g., genotypes, compare_groups, allele_frequency).
   The method of Krawitz et al is used to determine the equivalent indel region for each possible candidate.
   After possible realignment, and filtering to remove possible errors, EIR are reported with their frequencies.
   Please be advised that the VCF spec(s) are rather vague and as a result often interpreted differently by different
   programmers. This is especially true of the parts of the specification(s) that describe how to report indels. As a
   result of this situation, you might run into problems when trying to loading indel containing VCF files generated
   with Goby into other tools.
 - vcf-subset: Add ability to exclude positions at which all samples match the reference.
 - Add a replacement for the VCF-tools VCF-subset program. The Goby tool is orders of magnitude faster.
 - Improve vcf-compare mode. Now has the ability to provide a random samples of the positions that differ between the
   files being compared. Random samples are calculated for each kind of difference (missing from one file, missing
   one allele, two alleles, different genotypes)
 - vcf-compare now outputs Ti/Tv ratios for each sample in input file (in the output file only).
 - Fix scalability problem with local realignment code. Local realignment around indels would slow down as more entries
   were processed. This is now fixed so that speed is constant across large alignments.
 - Fixed index file writing. In some conditions, part of the alignment past the 2GB mark were not accessible
   with skipTo when reading files larger than 2GB. Use the upgrade mode to fix old alignments at a specific time, or
   use Goby as usual to have alignments upgraded on the fly.
 - Add mechanism to upgrade/fix large alignments indices with Goby 1.9.8.2. The upgrade mechanism uses concatenate
   alignment to rewrite an alignment index file if the size of the entries file exceeds 2GB. This is rather slow as
   the process reads and writes large alignments, to produce the new index file. While slow, upgrading is still faster
   than aligning the reads again. The process also requires approximately double the alignment size as the new alignment
   files are written.  Alignments smaller than 2GB are quietly ignored since they were not affected by the bug.
 - Codecs: Add support to decode alignments with a codec in AlignmentReader.
 - Improved ReadsReader to find a suitable decoder when several codecs exist.
 - Prevents local realignment from running out of memory when processing positions where clonal reads create huge peaks.
 - Make filterIndels remove from sample count info object, not just form list of bases.
 - Fix VCF genotypes that could look like 0/0/1/1 to be 0/1 (seen with indels only).
 - only write allele base count in VCF BC field when the count is not zero (useful with indels).
1.9.8.1
 - Discover-sequence-variants: add ability to describe zero, one or more group comparisons. Syntax is A/B,A/C to compare
   group A to B and group A to C. Additional pairs can be described, separated by coma.
 - Extend methyl-stats mode to estimate fraction of methylated cytosine observed in CpX contexts.
 - Discover-sequence-variants, genotype format: Fix a bug where alleleSet was cleared in each sample, rather than before
   any sample is processed. This made it possible for some positions to be ignored erroneously when samples were given
   on a specific order on the command line. Specifically, positions would be ignored if they were not typed (i.e., not
   enough good bases) in the last sample given on the command line.
 - Optimize merging of TMH when the files are large (>100M compressed).
 - Fixed a major bug where NonAmbiguousAlignmentReader would stop iterating after encountering an ambiguous alignment.
   Alignments with shorter reads were much more likely to be affected.
 - Fix sam-extract-reads for paired-end BAM files. Each BAM file contains both pairs. To convert to compact reads, the
   input BAM file must be sorted by read name, since this is the only way we can put the pairs back together in one
   Goby record.
 - Mode discover-sequence-variants now limits the maximum coverage per site in order to limit the impact on peak memory
   of a few very high coverage sites. The default setting is set to 500,000x and can be changed with
   option --max-coverage-per-site
 - Switched IndexedIdentifier to an AVLTreeMap to help scale when we have millions of elements to compare in diff exp.
 - Fixed a subtle bug in IterateSortedAlignment that would cause iteration to return partial results for some alignments
   when restricting results to a window. The problem would manifest more clearly for alignments against genomes where
   contigs have smaller indices than chromosomes and chromosome sequences are listed in non-increasing order (e.g., chr
   16 appearing before chr 10) and restricting to window from chr16 to MT (which should include chr 10 in that genome,
   but returned no result on chr 10).
 - Trim mode: Fix exception that could occur when trimming reads with no quality scores.
 - Change goby script to request the bash shell explicitly. This is needed on systems where bin/sh is not a synonym for
   bash. Thanks to Martin Frith for catching this on Ubuntu.
 - Change how targetLengths are concatenated. It turns out that last-to-compact needs alignment entries matching
   the target to record the length in the alignment. We need to keep any length seen when we concat because the first
   chunk may just not have the length for the remaining parts..
 - Improved logic for --paired-end filename support in the fastaToCompactMode.
 - Fix a NPE in suggest-position-slices that could occur with very small alignment files.
1.9.8
 - The BaseStats utility was transformed into a Goby mode (base-stats). The new mode has the ability to tally occurrence
   of CpX motifs in reads. Useful as a proxy to the amount of unconverted Cs in bisulfite converted reads.
 - The methyl-stats mode take a VCF file produced by Goby methylation output and a genome and calculates various
   statistics about the distribution of fragment lengths between CpG interrogated by the assay.
 - FDR mode now accepts --column-selection-filter to select columns matching string.
 - Proof of principle that protocol buffer can seamlessly cohabit with data-specific compression schemes. The
   --codec option on fasta-to-compact is introduced to activate compression of reads when writing compact reads.
   The codec provided (called read-codec-1) achieves about 10-12% better compression of read files than pure
   protocol-buffer encoding. This read-codec-1 codec stores bases and quality scores with an arithmetic coder in
   a protocol buffer field called 'compressed_data'. Please note that we do not recommend using this option at
   this stage since the C/C++ APIs cannot load data encoded with this codec at this time.
 - Add ability to run alignment-to-annotation-counts on a specific genomic region (see --start-position and
   --end-position).
 - alignment-to-annotation mode has a new option (--remove-shared-segments). When active, this option will remove
   annotation segments when they partially overlap with more than one primary annotation id. When this option is
   selected and the primary id is a gene, and secondary id is an exon, the mode will remove exons that are associated
   with several genes. When the option is used with transcript id as primary and exon as secondary, exons are removed
   that are shared across different transcripts of the same gene.
 - mode base-stats now supports multiple input files.
 - VCFParser will now set column type when reading TSV files by using TabToColumnInfoMode to scan the actual values
   stored in the TSV file. The first time this is done for a each file, a .colinfo file will be created and then
   used if the file is read again by VCFParser in the future.
 - Added the mode tab-to-column-info to read the data from TSV files to determine the the column types
   (double/integer/string). Write a .colinfo file detailing the column names and types.
 - Upgraded to SAM JDK 1.52
 - Modes sam-to-compact and sam-extract-reads now set SILENT validation before reading file header. This is required
   because the SAM JDK validation rules are more stringent than required by the specification. This means that
   some valid SAM files (per the SAM spec) cannot be parsed without error when the strict validation is used.
 - Fixed a bug with ReadsQualityStatsMode when when SampleFraction == 1.0d, such as for files with a small
   number of reads.
 - Mode sam-extract-reads now supports extracting reads from paired samples. See the new options --paired-end
   and --pair-indicator. These options work similarly to the fasta-to-compact options.
 - Fix problem with suggestion-position-slices that could create empty slices.
 - Fix bug in discover-sequence-variants methylation format that wrote methylation rates only for up to two samples.
 - Fix bug in alignment-to-counts that caused problems with large alignments.
1.9.7.3
 - Fix allele frequency format to write genotype first in FORMAT per vcf spec.
 - Add new INFO fields in compare group vcf format to show allele counts in each group.
 - Ability to support short versions of mode names, such as "compact-file-stats" has the short mode
   name "cfs". There is a default short mode name generation implementation in
   AbstractCommandLineMode.getShortModeName() but each mode class can override this method in the case
   of short mode name collisions. In the case of collisions, the command line parser will not offer/accept
   ANY short mode names for the classes in question.
 - SamToCompact: Generate sorted goby alignments when a sorted BAM files is provided as input (use --sorted
   flag to activate this option). Thanks to Bradford Powell for the suggestion and draft implementation.
 - Fixed a bug in tally-reads that was triggered by reads of different lengths. Thanks to Adrian Platts for
   the bug report.
1.9.7.2
 - Fix realignment around indels bug that prevented reads from being realigned to the left in exome data.
   Now correctly updates the start position of the moving window.
 - Renamed AlignmentEntry.splicedAlignmentLink to AlignmentEntry.splicedForwardAlignmentLink and added
   AlignmentEntry.splicedForwardAlignmentLink so splice links can be both bidirectional and more than
   two segments long. This change is included in the C/C++ APIs and make it possible for GSNAP to write
   splice information to Goby alignment files.
 - FDR mode now supports reporting the top n hits irrespective of corrected q-value threshold (top n hits are
    defined by the ranking produced by ordering the hits by increasing p-value, for the last column adjusted).
 - Significantly reduced memory consumption when performing FDR BH adjustment on hundreds of million of elements.
 - VCFWriter now writes missing value '.' in ID, ALT and FILTER fields, as required by VCF 4.1 documentation
   (http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)
   This change is required to read the files generated by Goby with the latest version of Tribble used in IGV EA.
 - AlignmentToTextMode will now display splice information.
1.9.7.1
 - alignment-to-counts now generates indexed base-level histogram files. Indexing makes it possible to jump quickly
   to a new genomic location in IGV. This is especially useful when viewing coverage for tens of tracks.
 - Filter out ambiguous reads from alignment-to-counts base level histogram output. Pre-1.9.7.1 behaviour can be
   obtained by setting the argument --filter-ambiguous-reads to false.
   alignment-to-counts: also tried a new way to create base-level histograms from sorted alignment files.
   This turns out to be about 3 times slower than the current approach. We still keep the new approach because it
   should scale to any size alignment. Mode alignment-to-count will use to the new approach if an alignment is sorted
   and has more than 50 million aligned reads.
 - Filter out ambiguous reads from alignment-to-annotation-counts by default. Pre-1.9.7.1 behaviour can be obtained
   by setting the argument --filter-ambiguous-reads to false.
 - Add ability to switch off the recording of sampleIndex. This is useful when concat is just used to put pieces
   of a large alignment back together after splitting reads for parallel processing.
 - Do not print indices at the end of upgrade. This caused upgrade to fail on some alignments with an exception.
 - Extended IterateAlignments to create alignment reader with a configurable AlignmentReaderFactory.
 - Set the default normalization method for alignment-to-annotation-count to bullard normalization only.
 - Fix a bug in VCFParser that affected parsing tab delimited files. Some files would be parsed with a tab in the
   value of the last column, separating the values of the last two actual columns.
1.9.7
 - Now using protobuf 2.4.1. Please upgrade your local version of protobuf if you are recompiling from sources.
 - AlignmentWriter now correctly records Goby version in header upon close(). This fixes a problem when alignments
   read from read-only files would fail upon trying a new upgrade.
 - Optimized the performance of VCFParser on files with large number of columns. The VCF format seems designed
   without performance in mind, so it is hard to come up with a reasonably fast implementation. The current
   implementation of the Goby VCF parser can only process about 8,000 lines of compressed VCF per second on
   a desktop machine.
 - AlignmentEntry schema change: a new field sample_index holds the index of the alignment from which the
   entry was read. This is useful when concatenating over multiple alignments and realigning reads that span
   indels, to reliably track the alignment origin of each entry.  The concatenation readers have been
   modified to set sample_index accordingly. Please note that the activeIndex field of the sorted reader
   is not a reliable way to identify the alignment of origin when realignment is active. Please use the
   new sample_index field instead.
 - We have added the capability to perform on the fly realignment around indels. This feature is available
   in mode discover-sequence-variants and in concatenate-alignments. The feature is activated with the new
   --processor realign_near_indels option. When the option is provided, a compressed reference genome must
   also be given on the command line (with the --genome option). This will trigger realignment of reads in
   regions where candidate indels are found by the aligner. The algorithm is very fast, in fact much faster
   than previously described approaches and consumes a reasonable amount of memory (function of maximum
   depth of coverage in the region where candidate indels are observed, but typically <2GB). Realignment
   correctly removes artefactual SNPs that can be introduced when an aligner fails to align the read ends
   properly through a read deletion. Please note that this version realigns read deletions. Realignment of
   read insertions has not been implemented.
 - Make it possible to open an alignment if the header file is present, but the entries file is missing.
   This allows to read the header only, for instance when we need to load counts and have access to targetIds.
 - Add mode to convert annotations to counts archive format.
 - Add new coverage mode to calculate coverage stats over annotation regions. When annotation regions are
   defined with capture regions, this mode outputs enrichment efficiency efficiency and depth of coverage for
   specific proportions of captured sites.
   The mode uses just .header and .count files and traverses count transitions. The algorithm used to iterate
   through count transitions is very efficient (for instance it takes about ~20 seconds to estimate coverage
   stats for an alignment with ~20M aligned reads). Count files are produced with GobyWeb together with the
   alignment or with the alignment-to-counts mode.
 - Add CountBinningAdaptor, useful to bin counts on the fly at any resolution for display in IGV.
 - Added ability to record total number of bases and sites seen in count archive.
 - Added a new mode (file-to-attributes) to generate a sample attribute file suitable for loading in IGV.
   Useful when files are named with the convention attr1-attr2-attr3.counts
1.9.6.1
 - Patched VCF output for compatibility with VCF specification. Specifically, we now write . in the QUAL
   field and write genotype as the first field in the methylation output format. Additionally, we only
   write a VCF line if the site can be typed in at least on of the samples. This changes make Goby VCF
   output compatible with the IGV 2.0 VCFTrack.
 - Fix a bug in merge that could trigger a ArrayIndexOutOfBoundsException with some alignments.
   
1.9.6
 - AlignmentReaderImpl now supports full random access to an alignment. Use reposition(ref,pos) followed
   by skipTo(ref,pos) to obtain the first entry matching at location (ref,pos). Prior to 1.9.6, the
   reposition method would not reposition to a location already visited forcing clients to close the
   alignment reader and reopen it (this new behaviour should improve performance in IGV).
 - The indexing logic used in versions of Goby up to 1.9.5 (inclusive) had subtle flaws. This could cause
   the skipTo method to behave incorrectly for some aligments. For instance,  if reads matched on target N
   at a position larger than the length of target N+1, these reads would not be returned by skipTo.
   Thanks to Alec Chapman for identifying these issues.
   We have corrected the problem and added additional unit tests to check the behavior of the implementation
   in various edge cases. A consequence of this change is that the new indexing logic requires recalculating
   the .index data structure for alignments sorted and indexed with a version of Goby prior to 1.9.6.
   We provide a new mode, goby upgrade, to perform these calculations and fix such alignments. To upgrade
   alignments off-line, simply do:
                                    goby 3g upgrade [files].
   This command will upgrade each alignment corresponding to the filenames provided. It skips those alignments
   produced by versions of Goby that do not require upgrading. The upgrade process creates a backup of the
   files that are affected: .index and .header are backed to .index.bak and .header.bak respectively.
   The upgrade process is relatively fast, in our tests we upgraded a 750Mb alignment file in 2'30".
 - Version 1.9.6 will try to upgrade alignments on the fly to the new version of the index data structures.
 - Detect when FastaToCompact is running in API mode versus command line. Do NOT do System.exit in API
   mode and instead throw exceptions. Also, API mode doesn't run conversions in parallel but instead runs
   them serially for easier exception catching.
 - VCFParser now splits headers by tab instead of whitespace so column names that contain spaces
   are read correctly.
1.9.5
 - Determine alignment sortedness and index state from the header and by checking that the index file exists.
   This allows to recover alignments when the index file was deleted. In such cases, sorting the alignment can
   be done again, this is preferable to losing the alignemnt data.
 - New mode simulate-reads will generate reads artifically against a reference sequence. We use this mode
   to create simulated datasets of bisulfite converted reads or mutated reads and to test that Goby produces
   the expected results.
 - Show phred scores in DisplaySequenceVariants (tab + base)
 - Add a QualityEncoding.PHRED in case one just wants to transfer quality scores without changing quality scale
 - Rewritten sam-to-compact mode that handles sequence variations better, handles bsmap sam files better,
   and handles quality score conversions more flexibly. The old mode is still around called
   sam-to-compact-old for comparison. The new mode has slightly different command line paramters.
 - Added a discover-sequence-variants mode format 'methylation' to estimate methylation rates for RRBS and
   Methyl-Seq alignments.
 - Dramatically improved TMH loading times for large alignemnts.
 - Completely removed support for queryLength in header. This usage was deprecated in Goby 1.7, complicates
   the code unecessarily and is error prone (because we had two ways to store read length in the previous
   versions of Goby). Note that versions since 1.7 had a concat mode that transfered information from the
   header to the alignment entries transparently. Use this mode from a pre 1.9.4 release if you need to
   migrate a 1.6- alignment to work with Goby 1.9.5+.
 - Fixed a bug where merge-compact-alignments would throw an ArrayIndexOutOfBounds because a TMH
   query index was smaller than the first query index in the alignment.
 - Changed discover-sequence-variant mode to filter out alignment entries whose read mapped multiple locations in the
   reference (as determined by the aligner argument (i.e., -n for gsnap)).
 - Made AlignmentReader an interface. The previous AlignmentReader class is now called AlignmentReaderImpl.
 - ConcatSortedAlignmentReader and ConcatAlignemntReader now support a configurable AlignmentReaderFactory.
   The factory makes it possible to plug in alignment reads that filter entries as they are read. The default
   factory returns all reads. However, if NonAmbiguousAlignmentReader factory is installed, the concatenate
   reader returns only entries for which the read did not match other locations in the genome. Other filtering
   behaviour can be implemented in a sub-class of AlignmentReader (see NonAmbiguousAlignmentReader for an example)
   and a factory created to return instances of this class. 
   This mechanism is used to filter out entries whose reads match several locations on the reference sequence.
 - Goby now includes a VCFParser class (see package edu.cornell.med.icb.goby.readers.vcf). VCF stands
   for Variant Call Format. The VCF format is described at http://www.1000genomes.org/node/101.
   The Goby VCFParser class implements a VCF 4.0+ parser. Importantly, this implementation also can be
   used to parse plain TSV files, or VCF that do not include the fixed VCF columns. It therefore support
   an extended version of the VCF format that is as generic as a TSV file, but can also provide meta-information
   about the columns in the specific file. Another difference with VCF 4.0 is that we support the Group
   attribute on column fields. This makes it possible to indicate that fields are part of the same group.
   Such a feature can be used by user interfaces that would like to offer the ability to manipulate multiple
   column fields as a group (for instance to hide or show an entire group of fields).
 - FDR mode now supports VCF input files and outputs. See the option --vcf to activate processing of VCF formatted
   files.
 - Added a VCFWriter class to write files in the VCF4 format. This class is now used by discover-sequence-variants
   when writing in genotypes format. This should make it possible to use vcf-tools on the genotype files produced.
 - Fix logic for IterateSortedAlignments which, in turn, fixes sequence-variation-stats2. The issue primarily
   dealt with insertions, deletions, and left and/or right padding.
 - Fixed the logic for TAB_SINGLE_BASE in display-sequence-variation mode to report the correct
   read_index and ref_position.
1.9.4
 - The C API (used by BWA, GSNAP) has been updated to more accurately write sequence variations (this version
   fixes problems in reporting of the read index). We have created examples of how sequence variations are
   encoded in Goby alignment files. These examples are available at http://tinyurl.com/goby-sequence-variations
 - Mode concatenate-alignments now propagates names and versions of the aligners that contributed input alignments.
 - Mode sort now propogates the name and version of the aligner that produced the alignment.
 - Mode compact-file-stats now reports the name and version of the aligner that produced a Goby alignment file.
 - Mode discover-sequence-variants has been extended to support multiple types of outputs (see --format flag).
   One output format prints genotypes (--format genotypes), while another estimates the proportion of the
   reference allele in each sample (--format allele_frequencies).
 - Added a mechanism to support base filters in discover-sequence-variants. To activate these filters, you must provide
   the --eval option with the "filter" option. Two filters are currently active when --eval filter is used: one
   filters variant bases by quality score (keeping only bases with q-phred>=30) and another is a simple and efficient
   strategy to remove bases that do not quite agree across all the observations.
   Future  versions will make it possible to customize the set of filters and their options. 
 - sequence-variation-stats2 now runs in parallel up to the available number of threads when multiple alignments are
   given as input.
 - display-sequence-variations and sequence-variation-stats modes: Fix problems in the logic to calculate
   read-index for large insertions/deletions.
1.9.3
 - This release has a C API compatible with our development version of GSNAP. A version of GSNAP released
   after 2011-03-11 should compile with Goby 1.9.3.
 - Add new statistics for discover-sequence variants. Notably, we now record the log odds ratio,
   the estimated standard error of the log odds ratio, as well as a Z-score for the log odds.
   Standard error and Z-score are only estimated if more than 10 counts exist in each cell of the contingency table.
   Also added the proportion of reference allele (refCount / (refCount+varCount).
 - Fix reformat-compact-reads bug where quality scores where longer by 1 than the sequence.
 - Reduce the memory needed by compact-file-stats to determine the number of reads in a compact reads file.
 - Changed how the number of reads in an alignment file is determined by compact-file-stats. We now report the number
   stored in the alignment header.
 - Change how log2 fold change was estimated. We used to estimate as ((log2_rpkm_group_a+1)/ (log2_rpkm_group_b+1)).
   This can cause problems when log2 rpkm are negative in one group and positive in the other. We now add 1 to counts
   before calculating RPKMs and taking the log. Similar changes were done to the fold-change. RPKM columns now return
   PRKM of (count+1).
 - Mode reformat-compact-reads now takes an optional -f argument to filter reads. This option can be used to
   remove redundant reads from a compact-reads file (see tally-reads mode to produce the read filter). It is no longer
   necessary to do round-trips to fastq to remove redundant reads.
1.9.2
 - Fixed a major bug in discover-sequence-variants that sometimes could cause confusion in the group of origin of a
   variation. This bug could affect between group p-values. A Junit test now checks for the error condition and
   is part of regression testing.   
 - sam-to-extract mode: append ".compact-reads" to output filename when the extension is missing.
 - Added a mode to display aligned reads for a region of the reference sequences. The reads are written
   in fasta format, suitable for viewing with a sequence alignment viewer such as JalView, CINEMA, etc.
   The mode is called alignment-to-pileup.
 - ConcatenateAlignmentReader would consume excessive amounts of memory when several large alignments
   (e.g., with >100 million reads) were concatenated. The reader was trying to allocate very large queryLength
   arrays, even though each underlying reader indicated that it its entries carried the queryLength.
   The fix consists in detecting that all the concatenated readers support queryLength in entries, and
   not allocating these arrays at all. This is a major bug fix that makes makes it possible to run more
   instances of goby modes on the same server (i.e., differential expression and sequence variant discovery
   modes have significantly improved memory usage).
 - Mode sam-extract-reads now supports an optional --quality-encoding argument. Default is BAM encoding.
 - QualityEncoding now supports BAM encoding (no offset or adjustment, the value of the character in
   ascii is the Phred score).
 - Fixed sam-extract-reads. Was not extracting sequences from BAM files.
 - compact-to-fasta mode: now supports reading an arbitrary slice of input.
 - sam-to-compact mode: draft support for importing SAM files produced by BSMAP.
 - fixed a bug that prevented running sam-to-compact mode from command line. An assertion prevented the code
   from running from the command line. Clarified the text of the assertion error and read the required parameter
   from the command line argument so that the mode will run again on SAM files generated outside of Goby.
 - reformat-compact-reads must trim quality scores in the same way that it trims the sequence. Quality scores
   were not trimmed in previous versions. This is now fixed.
 - reformat-compact-reads now correctly processes sequence pairs. Sequence pairs and quality scores can now
   be trimmed in the same way as the primary sequence.
 - Expose sampleFraction via API and command line for read-quality-stats mode
 - Make fasta-to-compact mode more callable via API
 - reformat-compact-reads during 'mutate' will no longer complain when there is no sequence-pair that it
   cannot mutate (mutation will not be attempted nor complained about if sequence.length is zero).
1.9.1
 - fasta-to-compact mode: fix bug that prevented checking that quality encoding are in the allowed range.
   quality score must now be converted within the correct score range before the compact-reads file can
   be written successfully.
 - Paralellize the estimation of statistics. This can speed up mode alignment-to-annotation-counts.
 - Introduced a field spliced_alignment_link and spliced_flags in AlignmentEntry to represent relation
   between parts of reads that span exon-exon junctions.
 - Introduced insert_size in Alignment entry to represent the size of the insert used when making
   the sequence library.
 - Introduced meta-data in compact-reads files. Meta-data provide a way to document how the sample
   was opbtained. Suggested information to be recorded includes when the library was sequenced (useful
   to detect batch-effect, as suggested by a participant to the SEQC meeting at the NIH Bethesda campus),
   as well as sequencing instrument. Modes fasta-to-compact, compact-file-stats and reformat-compact-reads
   have been updated to define, transfer or display meta-data when appropriate.
 - Mode compact-alignment-stats now prints statistics about paired-end reads.
 - Removed spurious SAM header when writing alignments in plain text format.
1.9
 - New fdr mode provides a tool to combine tab delimited file where some columns contain P-values and
   adjust selected P-values for multiple testing with the Benjamini Hochberg method. The tool is efficient
   in that it only keep P-values that need to be adjusted in memory, but otherwise keeps other column on disk.
   This strategy is expected to scale to hundreds of millions of lines of information. 
 - Add a way to open only a slice of an indexed alignment file by position. This feature makes it possible
   to retrieve all alignment entries that start between specific position boundaries. See new constructor
   in AlignmentReader and ConcatSortedAlignmentReader.
 - The mode discover-sequence-variant has been updated to take advantage of the alignment position slicing
   feature introduced in Goby 1.9. See the new arguments --start-position and --end-position.    
 - Fix a bug in skipTo that caused some alignment entries to fail to be returned (skipTo previoulsy ignored
   entries that occured in the chunk just before where the index points). This behaviour is incorrect because
   the chunk just before where the index points may contain entries with positions equal to the skipTo requested
   position. The index contract is to return the chunk that starts with an entry with the requested location.
   Because chunks contain multiple entries with increasing positions, the chunk immediately before the indexed
   chunk must be scanned and filtered to remove entries with positions before the skipTo requested position.
   A new test was written to check for this issue (TestSkipTo.testFewSkips4).
 - Provide Building/Installation instructions for the Goby C++/C API.
 - Implemented a fast concatenation operation for read files. The new -q flag in ConcatenateCompactReadsMode
   activates the fast concatenation. Chunks of compressed data are appended without requiring decompression and
   compression of the entries. This results in much faster concatenation that are bounded only by available IO.
 - Add mapping_quality field to AlignmentEntry protobuf schema.
 - Add aligner name and version in AlignmentHeader protobuf schema.
 - Added C/C++ api methods to set aligner name and version, and alignment entry mapping quality.
 - Updated the C API to be more generic, less oriented toward any one
   particular 3rd party tool. The read-API is now more generic, the write-API
   hasn't changed. The C API files, including the .h header files, have been renamed.
 - In C_Alignments.c/.h & C_CompactHelpers.h added CSamHelper and samHelper_* methods to assist
   with conversion of BWA to support CompactAlignments as the data stored in BWA just prior
   to writing alignments is effectively already in SAM format. These methods make it possible
   to reconstruct the aligned query and reference so data can be written in compact alignment.
 - Goby C/C++ API now requires the pcre (regex) >=8.10 library. See http://www.pcre.org/
 - Compact alignments now support paried-end alignments in Java / C++ / C APIs.
 - In alignment-to-text mode, output support in PLAIN and SAM for Paired End alignments
 - in alignemt .stats file rename the stat "number.aligned.reads" to the more accurate name
   of "number.alignment.entries" for both the Java API and the C++ api.
1.8
 - C API introduced to support native Goby support in GSNAP.
 - We now distribute a subset of Goby as the Goby IO API. This subset is packaged in the goby-io.jar
   file and released under the LGPL3 license. This was done to make it possible to include Goby format
   input output code directly into other software licensed under the LGPL3.
 - Fixed a bug that prevented Goby opening large alignment files (>3Gb).
 - Fixed a bug in AlignmentIterator triggered when reading alignment files with targetIndices starting at
   numbers larger than zero.
 - Removed dependency on colt (because it is not a pure LGPL license by adding restriction in military
   applications)
 - SGE helper scripts bz2compact.sh and keep-unique-reads.sh help process hundred of lanes in
   parallel on an SGE grid. bz2compact extracts fastq files compressed with BZip2 and converts
   them to compact-reads format. keep-unique-reads.sh determines the set of reads that are unique
   in each input <file>.compact-reads and writes this information to a <file>.uniqset-keep.filter 
 - Mode concatenate-compact-reads now supports read index filters. This makes it possible to
   concatenate and keep only reads that are unique within each file. 
 - Draft helper to iterate through individual reference positions of a sorted set of alignments
   (see IterateSortedAlignments).
 - Alternative implementation of sequence-variation-stats mode (called sequence-variation-stats2)
   that determines the number of reference bases matched at a given read index. This info is needed
   to call sequence variants, but slows down the stats. The initial implementation is preserved for
   compatibility.
 - New mode discover-sequence-variants will either (i) identify sequence variants within a group of sample
   or (ii) identify variants whose frequency is significantly enriched in one of two groups.
   This mode requires sorted/indexed alignments as input.
 - SamToCompact mode now populates the read quality scores for sequence variations (toQuality field).
 - Update picard/samtools to version 1.25.
 - In the mode "alignment-to-annotation-counts" the "--eval" options supports
   a new value "counts" which will output a format specifically designed
   for use with R's DESeq and notably for the R script geneDESeqAnalysis.R
   which is used with GobyWeb.
 - Fix bug in extract sequence variations for SAM format, where matches on the
   reverse strand got a read-index larger than one from the correct value.
 - By default, don't use "counts" in DiffExp as it is a specialized output for preparing for DESeq.
 - API interface for ReadsToWeightsMode.
 - LastToCompactMode wasn't writing target lengths. Fixed.
 - Read TMH in Python using Gzip.
 - Fixed Python utilies so -o actually writes to a file.
 - Added transcript-align.sh script to assist with aligning via transcripts.
 - In MessageChunksWriter, flush logic should occure on a COMPLETELY empty file, but otherwise it
   should only occure if entries have been added since the last flush(). In both C++ and Java.
 - DiffAlignmentMode can better compare differences when alignments were done by two different
   aligners and the Target Indexes are the same in label but not the same TargetIndex
   by building a master TargetIndex and translation maps for the two different alignments.
   Targets are now shown by label name instead of TargetIndex.
 - CompactFileStats --verbose on a compact alignment shows the targetIndex -> targetIdentifier
   map and also displays the targetLength for that targetIndex.
1.7
 - Extended fasta-to-compact and compact-to-fasta to handle paired end runs. See new command
   line arguments --paired-end and pair-indicator arguments in fasta-to-compact and
   --pair-output argument in compact-to-fasta.
 - Draft support for paired sequence runs. The compact file format is extended to store
   sequence, sequence length and quality scores for the paired run. This extension makes
   it possible to store both paired end runs in a single compact file. This should help
   keep the data together.
 - Implemented translation back and from Solexa quality score encoding in fasta-to-compact
   and compact-to-fasta. Thanks to Cock PJA et al NAR 2010 for the clear description of the
   Solexa base quality scores.
 - The sort mode now supports reading only a slice of an input alignment (see options
   --start-position and --end-position).
 - Refactored CompactAlignmentToAnnotationCountsMode to use IterateAlignments (provides
   large speed ups when working with sorted/indexed alignments and selecting a subset of
   reference sequences for DE).
 - IterateAlignments now takes advantage of the skipTo method when the alignment is sorted
   and indexed. This provides large performance improvements when one needs to access data
   for only a few reference sequences in an alignments. All the modes that use
   IterateAlignments benefit, including display-sequence-variations, and
   sequence-variation-stats.
 - Index alignments that are sorted upon writing. The skipTo method leverages the index
   to provide fast semi-random access to entries by genomic location. This feature is used
   by the IGV Goby plugin, which requires Goby 1.7+.
 - Concatenate alignment now produces sorted alignments if all the input alignments
   are sorted.
 - Added a mode to sort alignment by reference sequence and then by position
   on the reference sequence.
 - Support to estimate read weights described in Hansen KD et al NAR 2010.
   See http://campagnelab.org/software/goby/tutorials/estimate-heptamer-weights/
   In contrast to the initial publication, Goby supports using the weights to
   reweight annotation counts and transcript counts.
 - Support to estimate GC content weights for reads and to reweight raw counts to
   remove the dependence of counts on GC read content.
 - Preliminary support for barcoded reads (barcodes in the sequence), see new
   mode decode-barcodes (and tutorial online at
   http://campagnelab.org/software/goby/tutorials/handling-barcoded-reads/).
 - alignment-to-*-counts: New --eval argument allows to specify which statistics
   to evaluate when comparing samples.
 - alignment-to-*-counts: New eval options 'samples' will write a column per sample
   for RPKM, log2(RPKM) and raw counts. RPKM and log2(RPKM) are written once per sample
   and global normalization method.
 - Reduce memory requirements when concatenating many alignments. A change
   introduced in 1.6 caused more memory than needed to be allocated for each
   split of an alignment (as much as the number of reads in the file that
   was split). Each split now uses only as much memory as needed to keep
   query lengths for the split.
 - Dramatically improved performance for differential expression tests with millions of
   differentially expressed elements (e.g., exon+gene+other). The code previously
   incorrectly grew internal arrays from zero to the number of new DE element described
   in the annotation file.

 Changes that impact the compact alignment format:
 - The compact file format is extended to store sequence, sequence length and quality scores
   for the paired run. This extension makes it possible to store both paired end runs in a
   single compact file. This should help keep the data together.
 - Moved query lengths from header to alignment entries. This scales much
   better when processing large alignment files (generated from more than
   a few hundred million reads).
 - The optional 'sorted' attribute in header indicates if an alignment has been sorted.

1.6
 - First draft of the Goby Python API and demonstration tools (see
   directory python).
 - Fix bug where compact file stats mode reported that a compact alignment
   had query identifiers but actually did not
 - Added within-group-variability mode. This mode estimates Fisher P-values
   between pairs of samples taken from a group of homogeneous samples.
   Summary statistics such as average p-value, or minimum p-value are
   reported for each gene in each pair considered.
 - Update JRI.jar to version 0.8-4 which now works properly with 64-bit
   Windows.
 - Update commons-lang to version 2.5.
 - Optimized DE type storage.
 - Fixed a race condition in CompactAlignmentToAnnotationCountsMode.java
   when running in parallel by moving .reserve() out of the for loop.
 - Renamed DifferentialExpression.ElementTypes enum to ElementType
 - Fixed a bug in the DifferentialExpressionCalculator which reset
   ElementType for a value from the actual value to OTHER (in occurred
   in CompactAlignmentToAnnotationCountsMode). Now once ElementTypes
   is set for a label it cannot be changed.
 - CompactFileStatsMode now supports an optional -o to write the output
   to a file. If not specified the output will be written to stdout.
 - Reformat reads now preserve read indices from the input file.
   This is necessary when using concat alignment with
   --adjust-query-indices false

1.5
 - Added a mode to calculate counts and perform differential expression
   analysis for transcript runs (alignment-to-transcript-counts).
   Transcript runs are performed against a cDNA library. They find matches
   through through exon-exon junctions represented in the input cDNA
   library. They are a faster alternative to mapping the genome and
   exon-exon boundaries separately. Disadvantage is that these searches
   will only map to transcripts represented in the input library.

 - Changes to fasta-to-compact mode:
   - Add parallel processing in fasta-to-compact mode. Use the --parallel
     flag to activate.
   - Will now only regenerate compact-reads that do not
     exist, or are older than the input file.

 - Added a mode to write a read set to text format (set-to-text). The output
   will show the multiplicity of each query index.  ReadSets can be
   efficiently created with tally-reads as before.

 - Changes to CompactAlignmentToAnnotationCountsMode
   - Added new option --write-annotation-counts boolean, defaults to
     true. If set to false the annotation counts intermediate files
     will not be written.
   - Lines where "average count group *" values are ALL NaN or <= 0 will
     not be written.  This makes it so lines that don't add anything to
     the output are just omitted.
   - Added new option --omit-non-informative-columns, defaults to false.
     If set to true, columns in which all of the data is non-informative
     (values are ALL NaN or <= 0)  will be omitted.
   - Support for alternative global normalization methods. We currently
     provide an implementation of the upper quartile normalization method
     by Bullard et al (BUQ) and the normalization method provided in
     Goby 1.4 (CAC, normalize by the number of alignment record in a sample)
     See the --normalization-methods argument. New normalization methods
     can be used with Goby by creating an implementation of the
     NormalizationMethod interface,
     and adding a jar on the classpath that defines a ServiceProvider
     (see build.xml goby-jar target for an example of how this is done).
     When several normalization methods are given as an argument
     to --normalization-methods Goby will produce derived statistics
     for each normalization method and append them as new columns in
     the summary stats output. This makes it easy to compare alternative
     normalization methods on the same dataset.

 - Added support for sequence variations:
   - Changed the compact alignment format to support recording sequence
     variations.
   - The new mode display-sequence-variations provides text output of
     sequence variations in several formats.
   - The new mode sequence-variation-stats will print statistics about
     sequence variations found in a set of alignments.

 - Added support for quality scores:
   - Changed fasta-to-compact and compact-to-fasta to read and write with
     the Sanger or Illumina quality encoding.
   - Modified aligners to indicate which format they require (bwa needs
     fastq format, lastag fasta format, lastal fastq format). This will
     need extensive testing as some of these changes can affect gobyweb.
     We use the FASTQ-SANGER encoding to communicate with lastal.
     We don't yet support the Solexa quality score encoding (it is a bit
     obsolete anyway).

   Please note that the output format in compact-to-fasta now defaults to
   Fasta format.  This format has no quality scores, and consequently, we
   now never write quality scores when Fasta is requested. The aligners
   that need quality scores must request FASTQ format explicitly.

   See also:
     http://en.wikipedia.org/wiki/FASTQ_format
     http://maq.sourceforge.net/fastq.shtml
     http://last.cbrc.jp/last/doc/last-manual.txt (look for FASTQ-SANGER)

 - Changes to the Compact format:
   - Store target/reference sequence lengths in the alignment header. This
     information is helpful when calculating statistics such as RPKMs
     (transcript-level searches).
   - Store constant query lengths as one integer. Goby 1.4.1 stored one
     length for each read. This can become very memory consuming when the
     number of reads is very large. This change saves memory and storage.

1.4.1
 - Added a mode to write a read set to text format (set-to-text). The
   output will show the multiplicity of each query index.  ReadSets can
   be efficiently created with tally-reads as before.

1.4
 - Last aligner (http://last.cbrc.jp/) is now supported "out of the box".
   Tested against version last-96.  Support for the enhanced version
   "lastag" still exists.
 - Alignment-to-annotation-counts mode now computes a p-value using R
   (if available on the host)
 - Update to protobuf 2.3.0 (http://code.google.com/p/protobuf/)
 - Default extension for files written in Wiggle Track Format is now ".wig"
   for easier integration with the Integrative Genomics Viewer
   (http://www.broadinstitute.org/igv/).
   Similarly, the default extension for BedGraph Track Format files is
   now ".bed".

1.3
 - New "counts-to-bedgraph" mode which is similar to "counts-to-wiggle" but
   writes the data in "bedgraph" format, which is another format the
   Genome Browser accepts.
 - New mode "version" to write the jar's version number to stdout

 - counts-to-wiggle mode:
   - Write at most one entry per resolution-sized window of data (averaging
     the data in that window)
   - Don't write data past the end of the size of the chromosome (which
     is possible with resolution > 1)

 - compact-alignment-to-annotation-counts mode:
   - Fixed problem with BH FDR adjustment caused by NaN p-values.
   - ChiSquare test p-values are now correctly reported.
   - Adjusted P-values (Bonferroni and BH) are set to 1.0 if they would be
     larger than 1.
   - Added magnitude of fold change to group comparison tsv output.

1.2
 - compact-alignment-to-annotation-counts mode:
   - Added chi square test statistic and associated FDR adjusted stat.
     Chi-square statistics support multi-group comparisons.
   - Added the --parallel option to speed up computations on multiple core
     machines.

1.1
 - compact-alignment-to-annotation-counts mode:
   - Make it possible to process multiple alignment files in one run of
     the mode.
   - Added support for group comparisons. Group statistics are now computed
     and written to a summary file (see --comparison --stats and --groups
     options). The following statistics have been implemented: T-Test and
     fold-change across RPKMs in the comparison groups, Benjamini-Hochberg
     FDR adjustment for t-test P-value and Bonferroni correction for t-test
     P-value. Average RPKM in each group.

- Fix a bug where data matching chromosome "chr1" was excluded from wiggle
  tracks created from Goby count data.  (Mantis issue #1349)

1.0
- First public release.