Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use new k threshold #235

Merged
merged 48 commits into from
Feb 15, 2024
Merged

use new k threshold #235

merged 48 commits into from
Feb 15, 2024

Conversation

avehtari
Copy link
Collaborator

@avehtari avehtari commented Dec 28, 2023

Fixes #234

Ping @n-kall

@avehtari avehtari requested a review from jgabry December 28, 2023 12:31
@jgabry
Copy link
Member

jgabry commented Jan 16, 2024

I haven't had a chance to go through this in detail, but looks like it causes several tests to fail. Also I wonder if there will be any issue with reverse dependencies, but we can check that later.

@jgabry
Copy link
Member

jgabry commented Jan 22, 2024

@avehtari CRAN wants a new release of loo before February 20th to fix some minor documentation issue that doesn't affect anything and is only flagged in the newest version of R. We can include this if we fix the tests and check reverse dependencies. I can check the reverse dependencies once everything is passing.

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good, except for a few issues I commented on.

R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
@avehtari
Copy link
Collaborator Author

Yesterday, I started working on fixing the PR, and I'm sorry for being so sloppy the first time and wasting your time. As there were still some checks failing, I didn't push the changes that I had made so far. Thanks for the additional comments. l'll keep working on this today.

@avehtari
Copy link
Collaborator Author

  • more updates on the code and docs, so a new review would be good
  • I used attribute to pass the threshold along the pareto k table for printing, is this ok?
  • moment matching test results have changed due to changed threshold, did not yet update
  • loo_subampling has some tests errors I haven't yet been able to solve, but I'll keep working on those, but I expect that the tests need to be fixed

Copy link
Contributor

@n-kall n-kall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the doc changes, and made a few minor suggestions

R/loo-glossary.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/loo-glossary.R Outdated Show resolved Hide resolved
R/loo-glossary.R Outdated Show resolved Hide resolved
R/loo-glossary.R Outdated Show resolved Hide resolved
R/loo-glossary.R Outdated Show resolved Hide resolved
Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I made a bunch of tiny suggestions.

R/psis.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
R/diagnostics.R Outdated Show resolved Hide resolved
graphics::points(x = if (use_n_eff) n_eff[k < 0.5] else k[k < 0.5],
col = clrs[k < 0.5], pch = 3, cex = .6)
graphics::points(x = if (use_n_eff) n_eff[k < threshold] else k[k < threshold],
col = clrs[k < threshold], pch = 3, cex = .6)
Copy link
Member

@jgabry jgabry Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested all the changes to the plotting code, but they look ok

@avehtari
Copy link
Collaborator Author

Tests are passing on my computer. I still need to write more documentation.

@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2024

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (dab6e58) 93.00% compared to head (258adc8) 92.41%.

❗ Current head 258adc8 differs from pull request most recent head 372cb92. Consider uploading reports for the commit 372cb92 to get more accurate results

Files Patch % Lines
R/diagnostics.R 91.89% 3 Missing ⚠️
R/loo.R 71.42% 2 Missing ⚠️
R/psis.R 71.42% 2 Missing ⚠️
R/effective_sample_sizes.R 50.00% 1 Missing ⚠️
R/print.R 96.55% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #235      +/-   ##
==========================================
- Coverage   93.00%   92.41%   -0.59%     
==========================================
  Files          30       30              
  Lines        2788     2823      +35     
==========================================
+ Hits         2593     2609      +16     
- Misses        195      214      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@avehtari
Copy link
Collaborator Author

Improved docs, fixed a few more things in code

@jgabry
Copy link
Member

jgabry commented Jan 29, 2024

Great, looks like tests are passing on GitHub too, it's just R cmd check is failing due to a documentation issue: https://github.com/stan-dev/loo/actions/runs/7686962192/job/20946444630?pr=235#step:6:167. It might actually be that the doc just needs to be regenerated to match the code.

@avehtari
Copy link
Collaborator Author

avehtari commented Feb 2, 2024

All the vignettes have been now updated. I couldn't run loo2-non-factorized vignette as installation of spdep package failed (only data used from that package), but there shouldn't be any issues with that vignette.

@avehtari
Copy link
Collaborator Author

avehtari commented Feb 2, 2024

Should be complete now

Copy link
Contributor

@n-kall n-kall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the vignette changes. Looks good, only found one thing

vignettes/loo2-lfo.Rmd Outdated Show resolved Hide resolved
Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one problem with the changes to the mixis vignette that will cause an error (see review comment) but otherwise these vignette updates look ok to me

vignettes/loo2-mixis.Rmd Outdated Show resolved Hide resolved
@avehtari
Copy link
Collaborator Author

avehtari commented Feb 7, 2024

Here is a proposal for a longer news item

There are two major changes in the new loo release

  1. Use of new sample size specific diagnostic threshold for Pareto k. Pre-2022 PSIS paper recommended diagnostic thresholds as k<0.5 "good", 0.5 <= k <0.7 "ok", 0.7 <= k < 1 "bad", k>=1 "very bad". 2022 PSIS paper recommends k < min(1 - 1/log10(S), 0.7) "good", min(1 - 1/log10(S), 0.7) <= k < 1 "bad", k > 1 "very bad", where S is the sample size. There is one diagnostic threshold less, and the most important threshold now depends on the sample size S. With sample sizes 100, 320, 1000, 2200, 10000 the sample size specific part 1 - 1/log10(S) correspond to thresholds 0.5, 0.6, 0.67, 0.7, 0.75. Even if the sample size grows, the bias in PSIS estimate dominates if 0.7 <= k < 1, and thus the diagnostic threshold for good is capped at 0.7 (if k>1, mean does not exist and bias is not valid measure). The new recommended thresholds are based on more careful bias-variance analysis of PSIS based on truncated Pareto sums theory. For those who use the Stan default 4000 posterior draws the 0.7 threshold will be the same, but there will be less warnings as there will be no diagnostic message for 0.5 < k 0.7. Those who use small sample sizes, may see diagnostic messages with threshold less than 0.7, and they can simply increase the sample size to about 2200 to get the threshold 0.7.

  2. There are no more warnings if r_eff argument is not provided, and the default is now 1. The summary print output showing MCSE and ESSs shows diagnostic information on the range of r_eff. The change was made to reduce unnecessary warnings. Use of r_eff does not change the expected value of elpd_loo, p_loo, and Pareto k', and is needed only to estimate MCSE and ESSs. Thus it is better to show the diagnostic information about the used r_eff only when MCSE and ESS values are shown.

@jgabry
Copy link
Member

jgabry commented Feb 7, 2024

Here is a proposal for a longer news item

These look good. I made a few tiny edits and added them to the NEWS file. I will update the rest of the NEWS file to reflect some other changes we made in other PRs before releasing.

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of anything else we need to do before merging this. @avehtari or @n-kall, do you have anything else or should I go ahead and merge? (I will still wait about a week to do the release.)

@n-kall
Copy link
Contributor

n-kall commented Feb 7, 2024

do you have anything else or should I go ahead and merge?

Nothing that I can think of from my side. I think the next related thing will be to update the documentation for posterior::pareto_khat (and others) to match these updates, but I don't see any necessary changes needed in loo for that

@jgabry
Copy link
Member

jgabry commented Feb 8, 2024

Nothing that I can think of from my side.

Ok cool. And thanks for helping to review this.

I think the next related thing will be to update the documentation for posterior::pareto_khat (and others) to match these updates, but I don't see any necessary changes needed in loo for that

Sounds good, thanks.

@jgabry
Copy link
Member

jgabry commented Feb 13, 2024

@avehtari If I submit a bayesplot release with Teemu's fix for his recent PR, is that sufficient to fix the issue with the pit plots in the loo vignette or does the vignettes need additional editing after that? I think I will go ahead and submit that patch release of bayesplot today or tomorrow (it only contains that one bug fix).

@avehtari
Copy link
Collaborator Author

Teemu's bayesplot PR and fix for discrete data PIT affects ppc_pit_ecdf*, but not ppc_loo_pit_qq. Teemu's rsantools PR stan-dev/rstantools#121 fixes loo_pit and thus fixes ppc_loo_pit_qq used in loo2-example.Rmd

@jgabry
Copy link
Member

jgabry commented Feb 14, 2024

Oh ok thanks, that makes sense.

@jgabry
Copy link
Member

jgabry commented Feb 15, 2024

Merging now!

@jgabry jgabry merged commit ce0f540 into master Feb 15, 2024
6 checks passed
@jgabry jgabry deleted the new-pareto-k-threshold branch February 15, 2024 16:39
@avehtari avehtari mentioned this pull request Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't warn about slightly large khats
4 participants