Skip to content

Commit

Permalink
Errors in missing vs pad values in VCF
Browse files Browse the repository at this point in the history
  • Loading branch information
jeromekelleher committed Feb 15, 2024
1 parent a386541 commit 3279fe1
Show file tree
Hide file tree
Showing 3 changed files with 250 additions and 74 deletions.
18 changes: 13 additions & 5 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,16 @@ Improvements
- Improve performance of :func:`variant_stats` and :func:`sample_stats` functions.
(:user:`timothymillar`, :pr:`1119`, :issue:`1116`)

.. Bug fixes
.. ~~~~~~~~~
Bug fixes
~~~~~~~~~

- Fix error in missing data handling for VCF. Missing values for most
fields were marked as the corresponding "fill" value. For example, missing
string values were stored as the empty string (string fill value) rather
than "." (string missing value). Similarly for integer fields, missing
values were stored as -2 (int fill) rather than -1 (int missing)
(:user:`jeromekelleher`, :pr:`1190`, :issue:`1192`).


.. Documentation
.. ~~~~~~~~~~~~~
Expand Down Expand Up @@ -106,7 +114,7 @@ Deprecations
parameter now expects a full sized kinship matrix in which non-founder values are
ignored.
(:user:`timothymillar`, :pr:`1075`, :issue:`1061`)

Improvements
~~~~~~~~~~~~

Expand Down Expand Up @@ -190,9 +198,9 @@ Breaking changes
(:user:`timothymillar`, :pr:`995`, :issue:`875`)
- The ``genotype_count`` variable has been removed in favour of
:data:`sgkit.variables.variant_genotype_count_spec` which follows VCF ordering
(i.e., homozygous reference, heterozygous, homozygous alternate for biallelic,
(i.e., homozygous reference, heterozygous, homozygous alternate for biallelic,
diploid genotypes).
:func:`hardy_weinberg_test` now defaults to using
:func:`hardy_weinberg_test` now defaults to using
:data:`sgkit.variables.variant_genotype_count_spec` for the ``genotype_count``
parameter. (:user:`timothymillar`, :issue:`911`, :pr:`1002`)

Expand Down
6 changes: 3 additions & 3 deletions sgkit/io/vcf/vcf_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ def for_field(
dims.append(dimension)
chunksize += (size,)

array = np.full(chunksize, fill_value, dtype=dtype)
array = np.full(chunksize, missing_value, dtype=dtype)

return InfoAndFormatFieldHandler(
category,
Expand Down Expand Up @@ -304,7 +304,7 @@ def add_variant(self, i: int, variant: Any) -> None:
val if val is not None else self.missing_value
)
else:
self.array[i] = self.fill_value
self.array[i] = self.missing_value
elif self.category == "FORMAT":
val = variant.format(self.key)
if val is not None:
Expand All @@ -327,7 +327,7 @@ def add_variant(self, i: int, variant: Any) -> None:
a = a[..., : self.array.shape[-1]] # trim to fit
self.array[i, ..., : a.shape[-1]] = a
else:
self.array[i] = self.fill_value
self.array[i] = self.missing_value

def truncate_array(self, length: int) -> None:
self.array = self.array[:length]
Expand Down
Loading

0 comments on commit 3279fe1

Please sign in to comment.