Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Commits on Nov 17, 2024

  1. Implement PixelSelector::describe()

    This is supposed to be similar to the describe method from scipy.stats.
    The following statistics are supported:
    - nnz
    - sum
    - min
    - max
    - mean
    - variance
    - skewness
    - kurtosis
    
    All statistics are computed using non-zero pixels.
    If you need statistics including zeros, you are better off fetching
    interactions as a sparse matrix with `sel.to_csr()` and computing the
    required statistics using the sparse matrix.
    
    The main advantage of using hictkpy's describe() instead of other methods,
    is that all statistics are computed (or estimated) by traversing the data
    only once (and without caching pixels).
    All statistics except variance, skewness, and kurtosis are guaranteed to
    be exact.
    Variance, skewness, and kurtosis are not exact because they are estimated
    using the accumulator library from Boost.
    However, in practice, the estimation is usually very accurate (based on my
    tests, the rel. error is always < 1.0e-4, and typically < 1.0e-6).
    The estimates can be inaccurate when the sample size is very small.
    For the time being, working around this issue is the useri's responsibility.
    Example:
    
    ```python3
    f = hictk.File("test.cool")
    stats = ss.describe(list(p.count for p in f.fetch()))
    ```
    
    Another important feature of hictkpy's describe(), is the ability of
    recognizing scenarios where the required statistics can be computed
    without traversing all pixels overlapping with the given query.
    
    For example, as soon as the first pixel with a NaN count is encountered,
    we can stop updating the estimates for all statistics except nnz.
    This means that if we do not have to compute nnz, then describe() can return
    as soon as the first NaN pixel is found.
    If we have to compute nnz, then we must traverse all pixels.
    However, we can still reduce the amount of work performed by describe()
    by taking advantage of the fact that we only need to count pixels.
    
    We recommend using describe() when computing multiple statistics at the
    same time.
    When computing a single statistic we recommend using one of nnz(), sum(),
    min(), max(), mean(), variance(), skewness(), or kurtosis().
    
    All methods computing stats from a PixelSelector accept a keep_nans and
    keep_infs params which can be used to customize how non-finite values
    are handled.
    robomics committed Nov 17, 2024
    Configuration menu
    Copy the full SHA
    11192a2 View commit details
    Browse the repository at this point in the history
  2. Add tests

    robomics committed Nov 17, 2024
    Configuration menu
    Copy the full SHA
    3a77a2b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    13a1f4b View commit details
    Browse the repository at this point in the history
  4. Update docs

    robomics committed Nov 17, 2024
    Configuration menu
    Copy the full SHA
    fe8aff3 View commit details
    Browse the repository at this point in the history
  5. Refactor

    robomics committed Nov 17, 2024
    Configuration menu
    Copy the full SHA
    6e003df View commit details
    Browse the repository at this point in the history
  6. Fix MSVC builds

    robomics committed Nov 17, 2024
    Configuration menu
    Copy the full SHA
    977df75 View commit details
    Browse the repository at this point in the history