-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate a baseline for RMSE (bins) #157
Comments
Can you describe it step-by-step, please? |
Also, regarding estimating the lowest value. For AUC and logloss, it can be done in the following way:
The problem is estimating alpha and beta for each user. Perhaps you can think of something. EDIT: to make it more clear, here's some code
|
The idea is that for a bin with n samples and proportion p successes, suppose that the oracle always spits out p for every sample in this bin. Then we keep resampling n values each with proportion p of success and we calculate RMSE (bins) against the real p. This would not be a strict lower bound since a better oracle would know the exact probability of each individual review rather than predicting the same p for every review within the bin. And then when we do bootstrapping with such an oracle we also simulate each review's probability as well. But still, doing a baseline without exact per-review probabilities helps us to get a better understanding of how well our algorithms are doing and how much potential there is left when optimizing for RMSE (bins). For AUC and log loss I don't know, they seem to rely even more on oracles that know per-review probabilities. RMSE(bins) only cares about averages in a bin while log loss doesn't lose information from aggregating like that. |
Regarding the point about a better oracle with access to per-card probabilities, we can also simulate how the baseline would change by simulating our own probabilities. If the baseline changes too much then we can scrap the simpler baseline estimation method. |
Previously, I estimated alpha and beta an very crude way:
Well, plotting isn't necessary, but I'm just trying to illustrate it. This is from FSRS v4, btw.
This is some really old code from an old version, before we changed binning. The idea is to take FSRS predictions, eliminate systematic errors via a linear transformation of its outputs, then assume that this is close enough to the true distribution on probabilities of recall, then get alpha and beta from that distribution. Btw, I got that the logloss of the Oracle would be 0.27 and the AUC of the Oracle would be 0.83. |
Interesting. Do you have a similar graph for a recent fsrs version? |
You can plot one here: https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/v5.3.1/fsrs4anki_optimizer.ipynb It's plotted after optimization I wonder if we can get alpha and beta from the bins themselves, without using FSRS predictions. It would probably be more accurate. I mean, if you look at this, it sure as hell looks like a beta distribution. So we can probably somehow skip the part with using algorithmic predictions and just use the bins. Here's a new one from the optimizer (not my collection, but whatever): Btw, here's a fun calculator: https://homepage.divms.uiowa.edu/~mbognar/applets/beta.html |
I feel like log loss is too sensitive to what an optimal oracle could actually do. An optimal oracle could perhaps be given a 24/7 feed as to a user's life, be in the know of every emotion that user experiences, have access to neurons in the brain. Then it can predict with near certainty 1.00 or 0.00 for every single review. But this also affects my proposed RMSE (bins) baseline estimation; while proposed method would predict a non-zero baseline, a perfect oracle would indeed achieve an RMSE (bins) of 0. I think we should just discard these ideas. It's too hard to separate systematic errors with things that are actually random for a reasonable non-intrusive algorithm. By the way, this maybe shows why |
When I say "Oracle", I mean "an algorithm that knows the true probability of recall for a given card at a given time". I DO NOT mean "an algorithm that can peek into the human brain, check the status of each neuron, and output exactly 0.0 or 1.0 and nothing inbetween". And I don't want to discard this because it's fun. The problem is that the number (and choice) of bins will affect the estimates of alpha and beta, though. Anyway, can you take some random collection from the dataset and give me a list with the number of reviews in each bin, and a list of corresponding average retentions in each bin? Or a python dictionary instead of two lists, whatever. I want to do some curve-fitting. |
I just fear that the baselines will be too inaccurate. As you suggest, there could be more systematic errors that aren't accounted for in a simple predicted R vs actual R model. For example, we can use the bins from RMSE (bins) for a more fine-grained calibration. The possibilities are endless. In the end, the best baseline that we get from this will be the best model that we can come up with. E.g. have you seen if FSRS v4 added together with the calibration produces a better model than FSRS v4 alone? That is, do this calibration on the train set and see if it improves performance on the test set. If so, we may come up with an FSRS v4.1 that incorporates this calibration and is simply a better model. If it is somehow a worse model then we have shown that a calibration is unreliable for reducing systematic errors. Well I understand the point that it's fun, that's why I made this issue in the first place. That's why we're working on srs-benchmark. |
So you want something like {bin1: (0.90, 16), bin2: (0.95, 23), ...}? |
Yep. |
I'm not sure if that would be usable in practice, though. Slope = 0.964, intercept = 0.030. Now let's transform the probability given by FSRS in the following way: What will be the minimum and maximum of corrected_p? This means that we can get negative probabilities and/or probabilities greater than 1. Btw, in my old code I had to do this: Ok, maybe this approach is crap after all. BUT, if we estimate the alpha and beta from the bins themselves, without algorithmic predictions, then we're good. |
@Expertium https://github.com/1DWalker/srs-benchmark/tree/bin-stats You can find the bin statistics for 20 users in the |
Alright, this looks promising. Kinda. Code:
Now, if you dont mind, I want you to use this code to calculate the last three things for each user (all 10k of them): alpha, beta and total_reviews. Then make a .jsonl fine where the output looks like this: |
Alright, I'll do it myself and then report the results. |
For algorithms that do not globally adapt, RMSE (bins) remains a good metric since it is human interpretable and a perfect algorithm could reach an RMSE (bins) of near zero. But how low could it actually reach? We can potentially estimate this value by bootstrapping on each user. For each bin, suppose a perfect model would predict the exact average of that bin. Then we can use boostrapping on the bin and resample reviews to estimate a baseline error.
The text was updated successfully, but these errors were encountered: