Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Histogram with IntCategory and StrCategory axes can be zeroed by out-of-bounds fill values. #960

Open
NJManganelli opened this issue Sep 14, 2024 · 3 comments

Comments

@NJManganelli
Copy link

NJManganelli commented Sep 14, 2024

Describe the bug

When an IntCategory axis with growth=False and overflow=False is filled with values that fall outside the defined bins (which should be a valid combination, and is useful in certain contexts to 'ignore' invalid values without explicit masking of them), the result can be 0 for the entire histogram, at least when a StrCategory axis and Weight storage are used. This directly appears with using named versions of these axes types in scikit-hep/hist, and produces a stranger error in dask_histogram.

Steps to reproduce

This also includes the reproducer for hist and dask_histogram, to demonstrate some of the discrepancies

import awkward as ak
import dask_awkward as dak
import numpy as np
import hist
from hist.hist import Hist as CHist
from hist.dask import Hist as DHist
import boost_histogram as bh
from boost_histogram import Histogram as BHist
import dask

include_out_of_bounds_bin = False

mpc = ak.from_iter([-1, 93, 185, 74, 48, 67, 231, 128, 170, 213, 120, 17]*1000)
ptc = ak.Array(np.random.normal(loc=200, scale=40, size=len(mpc)))


model_pointlazy = dak.from_awkward(mpc, npartitions=33)
ptlazy = dak.from_awkward(ptc, npartitions=33)
wgtlazy = ak.ones_like(model_pointlazy, dtype=float)

if include_out_of_bounds_bin:
    axvals = [-1, 93, 185, 74, 48, 67, 231, 128, 170, 213, 120, 17]
else:
    axvals =     [93, 185, 74, 48, 67, 231, 128, 170, 213, 120, 17]

common = [
    hist.axis.StrCategory(['nominal'], name='systematic', label='Systematic'),
    #hist.axis.Regular(100, 20, 520, name='pt', label=r'1st Jet $p_T$ [GeV]'),
    hist.storage.Weight(),
    ]

axesstor = [
    hist.axis.IntCategory(axvals,
                          name="model_point",
                          label="{'uncategorized': -1, 'TChiZH_200_74': 93, 'TChiZH_250_124': 185, 'TChiZH_300_150': 74, 'TChiZH_400_200': 48, 'TChiZH_450_300': 67, 'TChiZH_600_1': 231, 'TChiZH_600_400': 128, 'TChiZH_750_450': 170, 'TChiZH_800_200': 213, 'TChiZH_900_1': 120, 'TChiZH_1500_550': 17}::Gen Model",
                          growth=False,
                          flow=True),
] + common
axesstornoflow = [
    hist.axis.IntCategory(axvals,
                          name="model_point",
                          label="{'uncategorized': -1, 'TChiZH_200_74': 93, 'TChiZH_250_124': 185, 'TChiZH_300_150': 74, 'TChiZH_400_200': 48, 'TChiZH_450_300': 67, 'TChiZH_600_1': 231, 'TChiZH_600_400': 128, 'TChiZH_750_450': 170, 'TChiZH_800_200': 213, 'TChiZH_900_1': 120, 'TChiZH_1500_550': 17}::Gen Model",
                          growth=False,
                          flow=False),
] + common 

bint = bh.axis.IntCategory(axvals, growth=False, overflow=True)
bintnoflow = bh.axis.IntCategory(axvals, growth=False, overflow=False)
bstr = bh.axis.StrCategory(['nominal'])
bstor = bh.storage.Weight()


bhistunmasked, bhistmasked = BHist(bint, bstr, storage=bstor), BHist(bint, bstr, storage=bstor)
bhistunmaskednoflow, bhistmaskednoflow = BHist(bintnoflow, bstr, storage=bstor), BHist(bintnoflow, bstr, storage=bstor)
chistunmasked, chistmasked = CHist(*axesstor), CHist(*axesstor)
chistunmaskednoflow, chistmaskednoflow = CHist(*axesstornoflow), CHist(*axesstornoflow)
dhistunmasked, dhistmasked = DHist(*axesstor), DHist(*axesstor)
dhistunmaskednoflow, dhistmaskednoflow = DHist(*axesstornoflow), DHist(*axesstornoflow)


msklazy = (model_pointlazy > -1)
dhistmasked.fill(**{
    "model_point": model_pointlazy[msklazy],
    "systematic": "nominal",
    "weight": wgtlazy[msklazy],
})
dhistunmasked.fill(**{
    "model_point": model_pointlazy,
    "systematic": "nominal",
    "weight": wgtlazy,
})
dhistmaskednoflow.fill(**{
    "model_point": model_pointlazy[msklazy],
    "systematic": "nominal",
    "weight": wgtlazy[msklazy],
})
dhistunmaskednoflow.fill(**{
    "model_point": model_pointlazy,
    "systematic": "nominal",
    "weight": wgtlazy,
})

tocompute = {"dhistmasked": dhistmasked, "dhistunmasked": dhistunmasked,
             "dhistmaskednoflow": dhistmaskednoflow, "dhistunmaskednoflow": dhistunmaskednoflow,
             "pt": ptlazy, "model_point": model_pointlazy, "wgt": wgtlazy}
comped = dask.compute(tocompute)[0]

model_point, pt, wgt = comped["model_point"], comped["pt"], comped["wgt"]

msk = (model_point > -1)
chistmasked.fill(**{
    "model_point": model_point[msk],
    "systematic": "nominal",
    "weight": wgt[msk],
})
chistunmasked.fill(**{
    "model_point": model_point,
    "systematic": "nominal",
    "weight": wgt,
})
chistmaskednoflow.fill(**{
    "model_point": model_point[msk],
    "systematic": "nominal",
    "weight": wgt[msk],
})
chistunmaskednoflow.fill(**{
    "model_point": model_point,
    "systematic": "nominal",
    "weight": wgt,
})

bhistmasked.fill(
    model_point[msk],
    "nominal",
    weight=wgt[msk],
)
bhistunmasked.fill(
    model_point,
    "nominal",
    weight=wgt,
)
bhistmaskednoflow.fill(
    model_point[msk],
    "nominal",
    weight=wgt[msk],
)
bhistunmaskednoflow.fill(
    model_point,
    "nominal",
    weight=wgt,
)

print_flow = False

print("histogram | BoostHistogram | Hist | DaskHistogram")
for key in ["histmasked",  "histmaskednoflow", "histunmasked", "histunmaskednoflow"]:
    ckey = f"c{key}"
    dkey = f"d{key}"
    bkey = f"b{key}"
    b = eval(bkey)
    c = eval(ckey)
    d = comped[dkey]
    print(key, b.sum().value, c.sum(flow=print_flow).value, d.sum(flow=print_flow).value)

This produces the following output, where the unmasked fill into the histogram without an overflow on the IntCategory produces 0 sum for (boost-)hist; and a junk-like value in DaskHistogram (presumably the partitioned fill plays an additional role here)

histogram | BoostHistogram | Hist | DaskHistogram
histmasked 11000.0 11000.0 11000.0
histmaskednoflow 11000.0 11000.0 11000.0
histunmasked 11000.0 11000.0 11000.0
histunmaskednoflow 0.0 0.0 7337.0

This bug was reproduced in 1.4.1 of boost-histogram, on an M1 Max chip, using MacOS 14.4; it was originally discovered on AlmaLinux8 (on Fermilab's LPC cluster) using a coffea container. Other principle libraries used are a mix of development versions corresponding mostly to the latest git version from approximately a month ago, and can be provided if necessary, but given the reproduction across different environments, it seems to be a relatively 'stable' bug

@NJManganelli NJManganelli changed the title [BUG] [BUG] Histogram with IntCategory and StrCategory axes can be zeroed by out-of-bounds fill values. Sep 14, 2024
@NJManganelli
Copy link
Author

A reproducer upstream in boost::histogram is beyond my time availability, but I'd expect it's equally true there, so I'm posting this bug report here as much for user-visibility as technical lack of a strict reproducer there.

@matthewfeickert
Copy link
Member

Other principle libraries used are a mix of development versions corresponding mostly to the latest git version from approximately a month ago, and can be provided if necessary, but given the reproduction across different environments, it seems to be a relatively 'stable' bug

@NJManganelli Just for developer time savings, can you provide the minimal install list required to reproduce this? (So the high level requirements.txt that you would pip install from, not the output of pip list.)

@NJManganelli
Copy link
Author

The core installs are as follows:

hist
uproot5
coffea
mplhep
awkward
dask-awkward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants