Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

robomics · 2024-11-17T21:03:21Z

This is supposed to be similar to the describe method from scipy.stats.
The following statistics are supported:

nnz
sum
min
max
mean
variance
skewness
kurtosis

All statistics are computed using non-zero pixels.
If you need statistics including zeros, you are better off fetching
interactions as a sparse matrix with sel.to_csr() and computing the
required statistics using the sparse matrix.

The main advantage of using hictkpy's describe() instead of other methods,
is that all statistics are computed (or estimated) by traversing the data
only once (and without caching pixels).
All statistics except variance, skewness, and kurtosis are guaranteed to
be exact.
Variance, skewness, and kurtosis are not exact because they are estimated
using the accumulator library from Boost.
However, in practice, the estimation is usually very accurate (based on my
tests, the rel. error is always < 1.0e-4, and typically < 1.0e-6).
The estimates can be inaccurate when the sample size is very small.
For the time being, working around this issue is the useri's responsibility.
Example:

f = hictk.File("test.cool")
stats = ss.describe(list(p.count for p in f.fetch()))

Another important feature of hictkpy's describe(), is the ability of
recognizing scenarios where the required statistics can be computed
without traversing all pixels overlapping with the given query.

For example, as soon as the first pixel with a NaN count is encountered,
we can stop updating the estimates for all statistics except nnz.
This means that if we do not have to compute nnz, then describe() can return
as soon as the first NaN pixel is found.
If we have to compute nnz, then we must traverse all pixels.
However, we can still reduce the amount of work performed by describe()
by taking advantage of the fact that we only need to count pixels.

We recommend using describe() when computing multiple statistics at the
same time.
When computing a single statistic we recommend using one of nnz(), sum(),
min(), max(), mean(), variance(), skewness(), or kurtosis().

All methods computing stats from a PixelSelector accept a keep_nans and
keep_infs params which can be used to customize how non-finite values
are handled.

This is supposed to be similar to the describe method from scipy.stats. The following statistics are supported: - nnz - sum - min - max - mean - variance - skewness - kurtosis All statistics are computed using non-zero pixels. If you need statistics including zeros, you are better off fetching interactions as a sparse matrix with `sel.to_csr()` and computing the required statistics using the sparse matrix. The main advantage of using hictkpy's describe() instead of other methods, is that all statistics are computed (or estimated) by traversing the data only once (and without caching pixels). All statistics except variance, skewness, and kurtosis are guaranteed to be exact. Variance, skewness, and kurtosis are not exact because they are estimated using the accumulator library from Boost. However, in practice, the estimation is usually very accurate (based on my tests, the rel. error is always < 1.0e-4, and typically < 1.0e-6). The estimates can be inaccurate when the sample size is very small. For the time being, working around this issue is the useri's responsibility. Example: ```python3 f = hictk.File("test.cool") stats = ss.describe(list(p.count for p in f.fetch())) ``` Another important feature of hictkpy's describe(), is the ability of recognizing scenarios where the required statistics can be computed without traversing all pixels overlapping with the given query. For example, as soon as the first pixel with a NaN count is encountered, we can stop updating the estimates for all statistics except nnz. This means that if we do not have to compute nnz, then describe() can return as soon as the first NaN pixel is found. If we have to compute nnz, then we must traverse all pixels. However, we can still reduce the amount of work performed by describe() by taking advantage of the fact that we only need to count pixels. We recommend using describe() when computing multiple statistics at the same time. When computing a single statistic we recommend using one of nnz(), sum(), min(), max(), mean(), variance(), skewness(), or kurtosis(). All methods computing stats from a PixelSelector accept a keep_nans and keep_infs params which can be used to customize how non-finite values are handled.

robomics · 2024-11-17T21:54:26Z

Before merging this it would be good to do/investigate the followings:

Detect small samples and re-compute estimated metrics that may be inaccurate
Add describe() to the fuzz suite

robomics added 4 commits November 17, 2024 21:34

Add tests

3a77a2b

Add script used to help generate the test cases for describe()

13a1f4b

Update docs

fe8aff3

robomics added the enhancement New feature or request label Nov 17, 2024

robomics linked an issue Nov 17, 2024 that may be closed by this pull request

[feature] Improve handling of NaNs when calling sum() and nnz() on PixelSelectors #106

Open

robomics added 2 commits November 17, 2024 22:30

Refactor

6e003df

Fix MSVC builds

977df75

robomics marked this pull request as draft November 17, 2024 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

robomics commented Nov 17, 2024

robomics commented Nov 17, 2024

Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

Are you sure you want to change the base?

Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

Conversation

robomics commented Nov 17, 2024

robomics commented Nov 17, 2024