Estimate a baseline for RMSE (bins) #157

1DWalker · 2025-01-18T21:02:12Z

For algorithms that do not globally adapt, RMSE (bins) remains a good metric since it is human interpretable and a perfect algorithm could reach an RMSE (bins) of near zero. But how low could it actually reach? We can potentially estimate this value by bootstrapping on each user. For each bin, suppose a perfect model would predict the exact average of that bin. Then we can use boostrapping on the bin and resample reviews to estimate a baseline error.

Expertium · 2025-01-18T21:07:53Z

Can you describe it step-by-step, please?

Expertium · 2025-01-18T21:12:23Z

Also, regarding estimating the lowest value. For AUC and logloss, it can be done in the following way:

For each user, will will need alpha and beta to model the distribution of the true probability of recall as a beta-distribution.
Then, we sample values of probability of recall from that distribution.
Then, we convert them into ones and zeros.
Finally, let's assume that an "Oracle" algorithm literally knows the true values (which we have, we sampled them) of the probability of recall. Then we can calculate logloss and AUC.

The problem is estimating alpha and beta for each user. Perhaps you can think of something.

EDIT: to make it more clear, here's some code

        predictions = np.random.beta(alpha, beta, n)
        # alpha and beta must be estimated for each collection, n is the number of reviews in that collection
        random = np.random.rand(n)  # uniform distribution
        real = np.where(predictions > random, 1, 0)
        # for example, if a value in 'predictions' (of the Oracle algorithm as I call it) is 0.99
        # then there is a 99% chance that 'real' will be 1
        # and a 1% chance that it will be 0

        logloss_value = logloss(predictions, real)
        fpr, tpr, thresholds_roc = roc_curve(real, predictions)
        auc_value = auc(fpr, tpr)

1DWalker · 2025-01-18T21:40:35Z

The idea is that for a bin with n samples and proportion p successes, suppose that the oracle always spits out p for every sample in this bin. Then we keep resampling n values each with proportion p of success and we calculate RMSE (bins) against the real p.
Example: if a bin has [0, 0, 1, 1] the oracle predicts p = 0.5 every time. But we sample a batch of [0, 1, 1, 1] with bootstrapping and get a squared error of (0.5-0.25)^2 = 0.0625 or something like that. Do this procedure many times to get an estimate for a baseline. For RMSE (bins) we actually don't need to sample when bootstrapping since we are just working with the binomial distribution.

This would not be a strict lower bound since a better oracle would know the exact probability of each individual review rather than predicting the same p for every review within the bin. And then when we do bootstrapping with such an oracle we also simulate each review's probability as well. But still, doing a baseline without exact per-review probabilities helps us to get a better understanding of how well our algorithms are doing and how much potential there is left when optimizing for RMSE (bins).

For AUC and log loss I don't know, they seem to rely even more on oracles that know per-review probabilities. RMSE(bins) only cares about averages in a bin while log loss doesn't lose information from aggregating like that.

1DWalker · 2025-01-18T21:44:15Z

Regarding the point about a better oracle with access to per-card probabilities, we can also simulate how the baseline would change by simulating our own probabilities. If the baseline changes too much then we can scrap the simpler baseline estimation method.

Expertium · 2025-01-18T21:54:35Z

Previously, I estimated alpha and beta an very crude way:

Take probabilities given by FSRS
Plot a calibration graph, like this:

Well, plotting isn't necessary, but I'm just trying to illustrate it. This is from FSRS v4, btw.
3) Use linear regression to obtain a trendline (instead of the jagged blue line)
4) Using the slope and intercept correct the FSRS predictions to eliminate systematic errors. The idea is that there are two types of errors: systematic and random. We can eliminate the systematic ones. For a perfect algorithm (orange line), slope=1 and intercept=0. So if we can adjust FSRS probabilities in such a way that slope=1 and intercept=0 on the calibration graph, we can make it so that there are no systematic errors.

    import statsmodels.api as sm
    
    p, y = predict(w_list, testsets, file=file)  # p is the probability of recall predicted by FSRS
    brier = load_brier(p, y, bins=20)
    bin_prediction_means = brier["detail"]["bin_prediction_means"]
    bin_correct_means = brier["detail"]["bin_correct_means"]
    bin_counts = brier["detail"]["bin_counts"]
    mask = bin_counts > 0
    fit_wls = sm.WLS(
        bin_correct_means[mask],
        sm.add_constant(bin_prediction_means[mask]),
        weights=bin_counts[mask],
    ).fit()

    corrected_p = [((x - fit_wls.params[0]) / fit_wls.params[1]) for x in p]

This is some really old code from an old version, before we changed binning.
5) Then we hand-wavily assume that this is good enough and we have the true distribution of probabilities of recall.
6) Then we do beta_parameters = beta.fit(corrected_p, floc=0.0, fscale=1.0, method="MLE"). This gives us alpha and beta.
7) Then we do the thing from my previous comment, with sampling from the beta distribution of each user

The idea is to take FSRS predictions, eliminate systematic errors via a linear transformation of its outputs, then assume that this is close enough to the true distribution on probabilities of recall, then get alpha and beta from that distribution.

Btw, I got that the logloss of the Oracle would be 0.27 and the AUC of the Oracle would be 0.83.

1DWalker · 2025-01-18T21:59:08Z

Interesting. Do you have a similar graph for a recent fsrs version?

Expertium · 2025-01-18T22:00:59Z

You can plot one here: https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/v5.3.1/fsrs4anki_optimizer.ipynb

It's plotted after optimization

I wonder if we can get alpha and beta from the bins themselves, without using FSRS predictions. It would probably be more accurate.

I mean, if you look at this, it sure as hell looks like a beta distribution. So we can probably somehow skip the part with using algorithmic predictions and just use the bins.

Here's a new one from the optimizer (not my collection, but whatever):

Btw, here's a fun calculator: https://homepage.divms.uiowa.edu/~mbognar/applets/beta.html

1DWalker · 2025-01-18T22:24:05Z

I feel like log loss is too sensitive to what an optimal oracle could actually do. An optimal oracle could perhaps be given a 24/7 feed as to a user's life, be in the know of every emotion that user experiences, have access to neurons in the brain. Then it can predict with near certainty 1.00 or 0.00 for every single review. But this also affects my proposed RMSE (bins) baseline estimation; while proposed method would predict a non-zero baseline, a perfect oracle would indeed achieve an RMSE (bins) of 0. I think we should just discard these ideas. It's too hard to separate systematic errors with things that are actually random for a reasonable non-intrusive algorithm.

By the way, this maybe shows why baseline + 1 power curve (+ ceil) works well for a neural network model. A single power curve probably works poorly for low R if it has to keep up with higher Rs. But FSRS v4's curve is different than FSRS v4.5 and on so maybe it is not the case.

Expertium · 2025-01-18T22:26:11Z

When I say "Oracle", I mean "an algorithm that knows the true probability of recall for a given card at a given time". I DO NOT mean "an algorithm that can peek into the human brain, check the status of each neuron, and output exactly 0.0 or 1.0 and nothing inbetween".

And I don't want to discard this because it's fun.

The problem is that the number (and choice) of bins will affect the estimates of alpha and beta, though. Anyway, can you take some random collection from the dataset and give me a list with the number of reviews in each bin, and a list of corresponding average retentions in each bin? Or a python dictionary instead of two lists, whatever. I want to do some curve-fitting.

1DWalker · 2025-01-18T22:40:58Z

I just fear that the baselines will be too inaccurate. As you suggest, there could be more systematic errors that aren't accounted for in a simple predicted R vs actual R model. For example, we can use the bins from RMSE (bins) for a more fine-grained calibration. The possibilities are endless. In the end, the best baseline that we get from this will be the best model that we can come up with. E.g. have you seen if FSRS v4 added together with the calibration produces a better model than FSRS v4 alone? That is, do this calibration on the train set and see if it improves performance on the test set. If so, we may come up with an FSRS v4.1 that incorporates this calibration and is simply a better model. If it is somehow a worse model then we have shown that a calibration is unreliable for reducing systematic errors.

Well I understand the point that it's fun, that's why I made this issue in the first place. That's why we're working on srs-benchmark.

1DWalker · 2025-01-18T22:41:49Z

So you want something like {bin1: (0.90, 16), bin2: (0.95, 23), ...}?

Expertium · 2025-01-18T22:42:03Z

So you want something like {bin1: (0.90, 16), bin2: (0.95, 23), ...}?

Yep.

Expertium · 2025-01-18T22:45:45Z

That is, do this calibration on the train set and see if it improves performance on the test set. If so, we may come up with an FSRS v4.1 that incorporates this calibration and is simply a better model.

I'm not sure if that would be usable in practice, though.
EDIT: ok, no, it definitely wouldn't.
Take a look at this:

Slope = 0.964, intercept = 0.030.

Now let's transform the probability given by FSRS in the following way:
corrected_p = (p-intercept)/slope

What will be the minimum and maximum of corrected_p?
minimum = (0-intercept)/slope = (0-0.03)/0.964 = -0.0311
maximum = (1-intercept)/slope = (1-0.03)/0.964 = 1.006

This means that we can get negative probabilities and/or probabilities greater than 1.

Btw, in my old code I had to do this:
corrected_p = np.clip(corrected_p, 0.0001, 0.9999)

Ok, maybe this approach is crap after all.

BUT, if we estimate the alpha and beta from the bins themselves, without algorithmic predictions, then we're good.

1DWalker · 2025-01-18T23:19:29Z

@Expertium https://github.com/1DWalker/srs-benchmark/tree/bin-stats

You can find the bin statistics for 20 users in the bin-stats folder or you can work with the dicts directly in get_bin_stats.py but spend some time processing.

Expertium · 2025-01-18T23:57:41Z

Alright, this looks promising. Kinda.

Code:

import numpy as np
from scipy.optimize import curve_fit
from scipy.special import gamma

def betafunc(x, a, b, scale):
    return scale * gamma(a+b) * (x ** (a-1)) * ((1-x) ** (b-1))  / (gamma(a) * gamma(b))

# dictionary
bin_data = ... 

retentions = []
n_reviews = []
for value in list(bin_data.values()):
    retentions.append(np.clip(value[0], 0.0001, 0.9999))
    n_reviews.append(value[1])

retentions = np.asarray(retentions)
n_reviews = np.asarray(n_reviews)

params, _ = curve_fit(betafunc, retentions, n_reviews, p0=(5, 1.2, 100), bounds=((0.01, 0.01, 0.01), (100, 100, 100_000)))
alpha = params[0]
beta = params[1]
total_reviews = np.sum(n_reviews)

Now, if you dont mind, I want you to use this code to calculate the last three things for each user (all 10k of them): alpha, beta and total_reviews. Then make a .jsonl fine where the output looks like this: {'alpha': alpha, 'beta': beta, 'total_reviews': total_reviews}. If there are NaNs or curve_fit fails or something, just exclude those users.
Then I'll use data from that .jsonl file to run simulations to calculate logloss and AUC of an Oracle.

Expertium · 2025-01-19T12:59:25Z

Alright, I'll do it myself and then report the results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate a baseline for RMSE (bins) #157

Estimate a baseline for RMSE (bins) #157

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

1DWalker commented Jan 18, 2025

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

1DWalker commented Jan 18, 2025

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

Expertium commented Jan 19, 2025

Estimate a baseline for RMSE (bins) #157

Estimate a baseline for RMSE (bins) #157

Comments

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025

Expertium commented Jan 18, 2025 • edited Loading

1DWalker commented Jan 18, 2025

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 • edited Loading

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 • edited Loading

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 • edited Loading

1DWalker commented Jan 18, 2025

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025

Expertium commented Jan 18, 2025 • edited Loading

1DWalker commented Jan 18, 2025

Expertium commented Jan 18, 2025 • edited Loading

Expertium commented Jan 19, 2025

Expertium commented Jan 18, 2025 •

edited

Loading

Expertium commented Jan 18, 2025 •

edited

Loading

Expertium commented Jan 18, 2025 •

edited

Loading

Expertium commented Jan 18, 2025 •

edited

Loading

Expertium commented Jan 18, 2025 •

edited

Loading

Expertium commented Jan 18, 2025 •

edited

Loading