-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add discrete_lognormal and discrete_normal #16
base: main
Are you sure you want to change the base?
Conversation
…etter way to avoid doing so?
…n the same fashion as `greta`
…receive improvement once greta-dev/greta#532 is resolved; we may also want to consider generalising this distribution by allowing floor, round, and ceiling
When I naively copy |
I reckon it's because of greta.distributions/R/discrete_normal.R Line 54 in 77a76a2
greta.distributions/R/discrete_normal.R Line 87 in 77a76a2
But not sure how to proceed now. |
Hi @njtierney I have tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see anything obvious that would explain discrete_lognormal() not recoveringg the parameters when used in a model. To debug that, I would recommend evaluating the density side by side with an R comparison, to make sure it's doing what you would expect.
R/discrete_normal.R
Outdated
) | ||
} | ||
|
||
# handle gradient issue between sdlog and 0s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may not need this at all - I think this was only in the lognormal distribution because of a discrepancy between how TF and R evaluate the lognormal density at 0 (or lower). In R dlnorm(0)
evaluates to 0, in TF I think it's either an error or NaN. That shouldn't be a problem for normal, since it has support on all real numbers, not just positive (or non-zero) numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh hang on. Now that I think about it, leaving this in could be what is biasing your estimates of the parameters, since the model gains no knowledge of any values lower than ~0. So maybe deleting the next line will help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look. If I simply delete
breaks <- pmax(breaks, .Machine$double.eps) |
Error in py_call_impl(callable, dots$args, dots$keywords) :
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,2739,0] = -2147483648 is not in [0, 50)
<... omitted ...>rom an input operation.
Input Source operations connected to node GatherV2:
Const_4 (defined at /framework/constant_op.py:180)
Original stack trace for 'GatherV2':
File "/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/ops/array_ops.py", line 3475, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/ops/gen_array_ops.py", line 4097, in gather_v2
batch_dims=batch_dims, name=name)
File "/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
See `reticulate::py_last_error()` for details
From the top lines it seems that we have negative values when we shouldn't have.
Sorry I am not familiar with the tensorflow codes but will try my best to fix.
I'm also not sure if the CDF section could be causing problem, esp. this:
x_safe <- tf$math$maximum(x, fl(.Machine$double.eps)) |
For completeness, with the latest commit, the code below will get a posterior of mulog
and sdlog
quite far from the default 0 and 1 in rlnorm
, respectively. (also unsure if the unintended constrain also make the sampler got stuck sometimes) From the code:
library(greta.distributions)
y <- rlnorm(10000)
y_r <- round(y)
mulog <- normal(0, 10)
sdlog <- exponential(10)
distribution(y_r) <- discrete_lognormal(mulog, sdlog, breaks = seq(0, ceiling(max(y)), 1)) # not sure if max(y) is legit
mod <- model(mulog, sdlog)
draws <- mcmc(mod, warmup = 2000)
plot(draws)
the posteriors look like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Did you make these changes to the lognormal instead of the normal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, yes I made these changes to lognormal, not normal. I copied the snippets from the wrong script. Just to be clear, I deleted breaks <- pmax(breaks, .Machine$double.eps)
in the lognormal case. They are corrected now in the previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, thanks for clarifying again. I removed pmax(breaks, ...)
from the normal case, then I ran the following test for normal:
library(greta.distributions)
y <- rnorm(10000)
y_r <- round(y)
mu <- normal(0, 10)
sd <- exponential(1)
distribution(y_r) <- discrete_normal(mu, sd, breaks = seq(floor(min(y)), ceiling(max(y)), 1))
mod <- model(mu, sd)
draws <- mcmc(mod)
plot(draws)
but the posterior mean and sd are far from 0 and 1, respectively.
Could this be due to the x_safe <- tf$math$maximum(x, fl(.Machine$double.eps))
part?
Also, if instead I shift the true normal mean to more negative and simulate data as y <- rnorm(10000, -10)
, I also got a similar error as in the lognormal case:
Error in py_call_impl(callable, dots$args, dots$keywords) :
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0,0] = -2147483648 is not in [0, 8)
[[node GatherV2 (defined at /util/dispatch.py:180) ]]
Errors may have originated from an input operation.
Input Source operations connected to node GatherV2:
Const_4 (defined at /framework/constant_op.py:180)
Original stack trace for 'GatherV2':
File "/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/ops/array_ops.py", line 3475, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/ops/gen_array_ops.py", line 4097, in gather_v2
batch_dims=batch_dims, name=name)
File "/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, good catch. I see now that you had flagged both of these already! tf_safe_cdf is hard-coded to have (0,Inf) support. It will need to be modified to pass in the bounds of the distribution. So the fl(0) bit will need to be changed to fl(lower_bound), and the lower bound passed in via an argument. For completeness, the upper bound should be passed in too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to implement your suggestion as well as removing x_safe <- tf$math$maximum(x, fl(.Machine$double.eps))
in 15cc772 , though not 100% sure what I'm doing 😅
Now sampling the model below
y <- rnorm(10000, 0, 1)
y_r <- round(y)
mu <- normal(0, 10)
sd <- exponential(1)
distribution(y_r) <- discrete_normal(mu, sd, breaks = seq(floor(min(y_r))-1, ceiling(max(y_r))+1, 1))
mod <- model(mu, sd)
yields the posteriors
Now sd
is so much closer to 1 (though mildly positively biased in all examples below), but mu
has a curious mode near 0.5.
It gets curious-er if we shift the true mean to, say, 2 with y <- rnorm(10000, 2, 1)
:
If we shift the true mean to more negative y <- rnorm(10000, -10, 1)
:
It seems that the estimated mu
tends to be positively bias for 0.5...
Maybe I made some mistakes in tf_safe_cdf
...
p/s: not sure if this is related, but sometimes large mean or sd magnitude (e.g., y <- rnorm(10000, 0, 10)
or y <- rnorm(10000, 10, 1)
) will result in
> draws <- mcmc(mod)
Error: Could not find reasonable starting values after 20 attempts.
Please specify initial values manually via the `initial_values` argument
but from the error message I wasn't able to debug further...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I just want to quickly add if this might be helpful: the 0.5 bias could be due to the way I specifiy the integer breaks seq(from, to, by = 1)
. Given that an unrounded variable z
is related to a rounded variable y
as z \in (y-0.5, y+0.5)
... not sure if this is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the culprit may be breaks
and how it is used in tf_safe_cdf
. Consider a continuous variable that is recorded/rounded as integers c(-2, -1, 0, 1, 2)
. When we specify discrete_normal(..., breaks = c(-2, -1, 0, 1, 2))
, tf_safe_cdf
will evaluate the CDF at these breaks, but don't we want it to evaluate at breaks - 0.5
and breaks + 0.5
? In other words, evaluating the CDF at breaks
implies that the discretisation happens between rather than at the breaks?
I pushed an experimental discrete_normal
(8676719) that also requires an additional argument edges
, which have one more element than breaks
and are the bounds of each breaks
values. For the example above we would specify edges = c(-2.5, -1.5 -0.5, 0.5, 1.5, 2.5)
. This feels too hard-coded / manual / hard to explain at the moment, and my code can probably be refactored. But playing with it yields a posterior of mean
that is much closer to the empirical mean:
library(greta.distributions)
y <- rnorm(10000, 0, 1)
y_r <- round(y)
mu <- normal(0, 10)
sd <- exponential(1)
breaks_diff <- 1
breaks <- seq(floor(min(y_r)), ceiling(max(y_r)), breaks_diff)
edges_min <- min(y_r) - breaks_diff/2
edges_max <- breaks[length(breaks)] + breaks_diff/2
edges <- c(edges_min, breaks[-length(breaks)] + breaks_diff/2, edges_max)
distribution(y_r) <- discrete_normal(mu, sd, breaks = breaks, edges = edges)
mod <- model(mu, sd)
draws <- mcmc(mod, warmup = 2000)
plot(draws)
Compare the empirical mean to the posterior summary.
> mean(y)
[1] -0.006322464
> summary(draws)
Iterations = 1:1000
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu -0.008201 0.010377 0.0001641 0.0001913
sd 1.002026 0.007522 0.0001189 0.0001978
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu -0.02856 -0.01492 -0.008117 -0.001294 0.01179
sd 0.98783 0.99688 1.002065 1.007069 1.01686
R/discrete_normal.R
Outdated
) | ||
continuous <- d$sample(seed = seed) | ||
# tf$floor(continuous) | ||
tf$round(continuous) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that the breaks are all the integers. I think it should instead be tf$gather()
ed at the breaks. I think this is also a mistake in my implementation of the discrete lognormal (though I never used the sampling in that code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rounded data, we assume that a user will specify breaks like seq(from, to, by = 1)
. But for general use we also want to allow something irregular like c(0, 1, 10, 100)
to work. For the latter, I'm not sure how to gather a continuous variable at the breaks. For instance, do we gather 50.67 into the 10 or the 100 break point? Does this entail an additional argument for discrete_normal
to gather values into the lower or upper break point?
But I like the gather idea because it allows the discretisation into non-integer breaks like c(-0.2, -0.1, 0, 0.1, 0.2)
, whereas floor
or round
discretises into interger only (without tweaks as suggested here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 49f6d0e I tried my best to ditto from the tf$gather()
chunk for breaks
to discretise the continuous samples into user-defined breaks
, rather than using round
or floor
. I thought this might solve the weird +0.5 estimation bias, but it didn't... But at least we know it isn't due to round vs. gather (if I coded it correctly).
… in discrete_normal to remove the positive constrain
R/zzz.R
Outdated
# crate the node list object whenever the package is loaded | ||
.onLoad <- function(libname, pkgname) { # nolint | ||
|
||
# unset reticulate python environment, for more details, see: | ||
# https://github.com/greta-dev/greta/issues/444 | ||
Sys.unsetenv("RETICULATE_PYTHON") | ||
|
||
if (have_greta_conda_env()) { | ||
use_greta_conda_env() | ||
} | ||
|
||
# silence TF's CPU instructions message | ||
Sys.setenv(TF_CPP_MIN_LOG_LEVEL = 2) | ||
|
||
# silence messages about deprecation etc. | ||
# disable_tensorflow_logging() | ||
|
||
# warn if TF version is bad | ||
# check_tf_version("startup") | ||
|
||
# switch back to 0-based extraction in tensorflow, and don't warn about | ||
# indexing with tensors | ||
options(tensorflow.one_based_extract = FALSE) | ||
options(tensorflow.extract.warn_tensors_passed_asis = FALSE) | ||
|
||
# default float type | ||
options(greta_tf_float = "float64") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to test this, but I assume this isn't needed since this is called in greta
, and greta.distributions
has a dependency on greta
, so effectively does library(greta)
when it is loaded.
I believe that you will need the tfp
part however, as that is not exported from greta
discrete_lognormal(p = 1, psi = 1, breaks = 1) | ||
) | ||
|
||
# check dim to be a positive scalar integer | ||
expect_snapshot_error( | ||
discrete_lognormal(p = 1, psi = 1, breaks = c(1, 2), dim = 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the parameters for discrete_lognormal
be meanlog
and sdlog
? Currently this will capture an error message about parameter names not being correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opps that was me copy-pasting mindlessly. But while I was at it, having a non-integer in the dim
argument discrete_lognormal(p = 1, psi = 1, breaks = c(1, 2), dim = 0.1)
did not error. It seems that non-integer dim
will generate dimension that they are floored to, or whatever the value before the decimal takes. For example
> normal(0, 1, dim = 0.9)
greta array (variable following a normal distribution)
[,1]
and
> normal(0, 1, dim = 2.9)
greta array (variable following a normal distribution)
[,1]
[1,] ?
[2,] ?
I think this mistake rarely happens day-to-day, so not sure if this should be checked in greta
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm! We should check this, as dimension should strictly be an integer. One issue I can see this coming up is where people might programmatically change parameter values for generating arrays and perhaps the wrong parameter gets passed along.
We could write a little checker for integer values - what do you think @goldingn ?
Perhaps something like
0.1 %% 1
#> [1] 0.1
1.1 %% 1
#> [1] 0.1
1 %% 1
#> [1] 0
2 %% 1
#> [1] 0
check_is_integer <- function(x){
is_int <- (x %% 1) == 0
if (!is_int){
msg <- cli::format_error(
c(
"{.val x} must be an integer",
"However {.val x} is {.val {x}}"
)
)
rlang::abort(
msg
)
}
}
check_is_integer(1)
check_is_integer(1.1)
#> Error in `check_is_integer()`:
#> ! "x" must be an integer
#> However "x" is 1.1
Created on 2022-08-04 by the reprex package (v2.0.1)
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.0 (2022-04-22)
#> os macOS Monterey 12.3.1
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Perth
#> date 2022-08-04
#> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.3.0.9000 2022-06-15 [1] Github (r-lib/cli@31a5db5)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0)
#> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0)
#> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.0)
#> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Thanks so much for your work on this PR, @hrlai ! Really well executed, and will be very useful for us to see how you have added these distributions, and should provide a great guideline of how to submit a PR for new distributions. I've given some higher level comments on some of the code. All very minor! Regarding the distribution being off by 0.5 - the only thing that I can think of is if we need to double check python indexing or something strange like that? We've recently been burned by that in migrating over the TF2. But that's probably not it, just a thought! |
…ontinuous variable into user-defined breaks, so we do not assume a support for all intergers while generalising the discretisation to non-regular breaks
…n of CDFs; breaks will still be used for discretisation; this version of discrete_normal requires users to specify both breaks and edges; experimental and be prepared to roll back to b9a412f
# note-to-self: this looks like https://mc-stan.org/docs/2_29/stan-users-guide/bayesian-measurement-error-model.html#rounding | ||
low <- tf_safe_cdf(tf_lower_vec, d, tf_lower_bound, tf_upper_bound) | ||
up <- tf_safe_cdf(tf_upper_vec, d, tf_lower_bound, tf_upper_bound) | ||
log_density <- log(up - low) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the simulated data has high variance, e.g., rnorm(..., sd = 10)
, I often get this error:
> draws <- mcmc(mod)
Error: Could not find reasonable starting values after 20 attempts.
Please specify initial values manually via the `initial_values` argument
The problem may lie in log_density <- log(up - low)
. Gut feeling is that high variance generates data that are more widely spaced, so when breaks
are regularly spaced integers (see tf_idx
), some bins have zero elements, then tf_safe_cdf
has nothing to evaluate? Then we end up taking the log of zero... ???
When there is a chance to revisit this, make sure to go over https://doi.org/10.1007/s13198-021-01443-x and see if |
Merge commit 'dee60f024c0c3df7c4102f783ca7a91b2c078da7' #Conflicts: # DESCRIPTION # NAMESPACE # R/package.R # R/zzz.R
In order to finish merging this I'd like to make sure we've got a reference distribution/test for discrete lognormal and discrete normal. It looks like we can do this for discrete normal, via extraDistr: https://search.r-project.org/CRAN/refmans/extraDistr/html/DiscreteNormal.html |
This is a draft pull request 🔩 addressing #7. I'm experimenting with this draft feature 😉
Quick summary:
greta
discrete_lognormal
now seems to work, andcalculate
works on it.To do soon:
Check ifIt isn't supposed to (?) because HMC cannot sample discrete space; I got error as intendeddiscrete_lognormal
works withmcmc
Error: model contains a discrete random variable that doesn't have a fixed value, so inference cannot be carried out
discrete_lognormal
works when assigned todistribution(y)
discrete_lognormal
works when it has hyperparametersImplement check to make sureTo do later... currently, ifmax(breaks)
cannot be less than maximum of datamax(breaks)
is less thanmax(y)
the users will get an error that isn't very helpfultf$round
instead oftf$floor
? See also Makegreta::round
use tensorflow digits arg instead of warning greta#532discrete_normal