Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Way to request stat_bin() to inherit breaks from the scale #6159

Open
arcresu opened this issue Oct 25, 2024 · 1 comment
Open

Way to request stat_bin() to inherit breaks from the scale #6159

arcresu opened this issue Oct 25, 2024 · 1 comment

Comments

@arcresu
Copy link

arcresu commented Oct 25, 2024

I often need to produce histograms where the x axis uses a date scale, typically binned by day, week, or month. The only sensible end result is where the scale's breaks align with the bins, but the existing methods I'm aware of for getting there are a bit fragile:

library(ggplot2)

set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

Use geom_bar() and a binned scale

ggplot(df, aes(date)) +
  geom_bar() +
  scale_x_binned(
    transform = scales::transform_date(),
    breaks = scales::breaks_width("1 week")
  )
#> Warning in scale_x_binned(transform = scales::transform_date(), breaks = scales::breaks_width("1 week")): Ignoring `n.breaks`. Use a breaks function that supports setting number of
#> breaks.

Nice because the binning is specified only once, but now the whole scale is binned, so I can't for example add a geom_vline() to mark a specific date on the axis, since the vertical line would then be snapped into a bin by the scale transform.

Use stat_bin()

ggplot(df, aes(date)) +
  geom_histogram(binwidth = 7, closed = "right") +
  scale_x_date(date_breaks = "1 week")

The naive approach leaves the scale breaks and the bins unaligned (offset by 0.5 days here). Of course this can be improved by specifying a bin boundary or manually passing breaks but this gets a bit fiddly and fragile.

Since #5963 there's a better workaround:

ggplot(df, aes(date)) +
  geom_histogram(breaks = function(x) { scales::breaks_width("1 week")(as.Date(range(x))) }) +
  scale_x_date(date_breaks = "1 week")

Created on 2024-10-25 with reprex v2.1.1

which is the result I want. However, there's duplication of the breaks and transforms between the scale and the stat. Ideally I'd like a way to request stat_bin() to just use the scale's breaks.

It's technically possible, since StatBin::compute_group (where the bins are computed) already has access to the scale object, but I'm not sure if it violates any sort of ggplot API encapsulation principles to have the scale directly affecting the stat's output in the way I'm proposing.

The same situation applies for stat_bin_2d() and stat_summary_bin(). I'd be happy to open a PR if there's agreement about the idea. I'm imagining either new value/s for breaks or a new param mutually exclusive with breaks that lets users choose to use the corresponding scale's major or minor breaks for the stat's binning breaks.

@teunbrand
Copy link
Collaborator

On the one hand, I like the idea. On the second hand, I don't think it can be implemented cleanly.

The issue is that scales recompute their ranges, which form the basis for the breaks, in between when the stats are calculated and when the graphics are drawn. It means that another layer can invalidate the breaks that are used for binning, and the scale ends up displaying different breaks. However, this should not be an issue if fixed breaks are used.

To demonstrate the principle, we can make a quick and dirty extension that takes breaks from the scale. We see that it doesn't really work well because the computed bins are less wide than the full data and the final breaks end up different than the intermediate breaks.

library(ggplot2)
set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

StatBin2 <- ggproto(
  "StatBin2", StatBin,
  compute_panel = function(self, data, scales, breaks = NULL, ...) {
    breaks <- breaks %||% scales$x$get_transformation()$inverse(scales$x$get_breaks())
    ggproto_parent(StatBin, self)$compute_panel(data, scales, breaks = breaks, ...)
  }
)

p <- ggplot(df, aes(date)) +
  geom_histogram(stat = StatBin2)
p
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

However this works fine when breaks are fixed.

p + scale_x_date(breaks = "5 days")
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2024-10-25 with reprex v2.1.1

I'm not sure if it violates any sort of ggplot API encapsulation principles to have the scale directly affecting the stat's output in the way I'm proposing.

I think the principle ggplot2 tries to adhere to is that scales and layers only communicate through the data and not directly with oneanother. On a personal level, I think it is fine to read out scale settings at the Stat$$compute_group() stage, but not fine to write scale settings. Pre-computing breaks and setting these at the scale's breaks should not happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants