Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate a baseline for RMSE (bins) #157

Open
1DWalker opened this issue Jan 18, 2025 · 16 comments
Open

Estimate a baseline for RMSE (bins) #157

1DWalker opened this issue Jan 18, 2025 · 16 comments

Comments

@1DWalker
Copy link
Contributor

For algorithms that do not globally adapt, RMSE (bins) remains a good metric since it is human interpretable and a perfect algorithm could reach an RMSE (bins) of near zero. But how low could it actually reach? We can potentially estimate this value by bootstrapping on each user. For each bin, suppose a perfect model would predict the exact average of that bin. Then we can use boostrapping on the bin and resample reviews to estimate a baseline error.

@Expertium
Copy link
Contributor

Can you describe it step-by-step, please?

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

Also, regarding estimating the lowest value. For AUC and logloss, it can be done in the following way:

  1. For each user, will will need alpha and beta to model the distribution of the true probability of recall as a beta-distribution.
  2. Then, we sample values of probability of recall from that distribution.
  3. Then, we convert them into ones and zeros.
  4. Finally, let's assume that an "Oracle" algorithm literally knows the true values (which we have, we sampled them) of the probability of recall. Then we can calculate logloss and AUC.

The problem is estimating alpha and beta for each user. Perhaps you can think of something.

EDIT: to make it more clear, here's some code

        predictions = np.random.beta(alpha, beta, n)
        # alpha and beta must be estimated for each collection, n is the number of reviews in that collection
        random = np.random.rand(n)  # uniform distribution
        real = np.where(predictions > random, 1, 0)
        # for example, if a value in 'predictions' (of the Oracle algorithm as I call it) is 0.99
        # then there is a 99% chance that 'real' will be 1
        # and a 1% chance that it will be 0

        logloss_value = logloss(predictions, real)
        fpr, tpr, thresholds_roc = roc_curve(real, predictions)
        auc_value = auc(fpr, tpr)

@1DWalker
Copy link
Contributor Author

The idea is that for a bin with n samples and proportion p successes, suppose that the oracle always spits out p for every sample in this bin. Then we keep resampling n values each with proportion p of success and we calculate RMSE (bins) against the real p.
Example: if a bin has [0, 0, 1, 1] the oracle predicts p = 0.5 every time. But we sample a batch of [0, 1, 1, 1] with bootstrapping and get a squared error of (0.5-0.25)^2 = 0.0625 or something like that. Do this procedure many times to get an estimate for a baseline. For RMSE (bins) we actually don't need to sample when bootstrapping since we are just working with the binomial distribution.

This would not be a strict lower bound since a better oracle would know the exact probability of each individual review rather than predicting the same p for every review within the bin. And then when we do bootstrapping with such an oracle we also simulate each review's probability as well. But still, doing a baseline without exact per-review probabilities helps us to get a better understanding of how well our algorithms are doing and how much potential there is left when optimizing for RMSE (bins).

For AUC and log loss I don't know, they seem to rely even more on oracles that know per-review probabilities. RMSE(bins) only cares about averages in a bin while log loss doesn't lose information from aggregating like that.

@1DWalker
Copy link
Contributor Author

Regarding the point about a better oracle with access to per-card probabilities, we can also simulate how the baseline would change by simulating our own probabilities. If the baseline changes too much then we can scrap the simpler baseline estimation method.

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

Previously, I estimated alpha and beta an very crude way:

  1. Take probabilities given by FSRS
  2. Plot a calibration graph, like this:

Image

Well, plotting isn't necessary, but I'm just trying to illustrate it. This is from FSRS v4, btw.
3) Use linear regression to obtain a trendline (instead of the jagged blue line)
4) Using the slope and intercept correct the FSRS predictions to eliminate systematic errors. The idea is that there are two types of errors: systematic and random. We can eliminate the systematic ones. For a perfect algorithm (orange line), slope=1 and intercept=0. So if we can adjust FSRS probabilities in such a way that slope=1 and intercept=0 on the calibration graph, we can make it so that there are no systematic errors.

    import statsmodels.api as sm
    
    p, y = predict(w_list, testsets, file=file)  # p is the probability of recall predicted by FSRS
    brier = load_brier(p, y, bins=20)
    bin_prediction_means = brier["detail"]["bin_prediction_means"]
    bin_correct_means = brier["detail"]["bin_correct_means"]
    bin_counts = brier["detail"]["bin_counts"]
    mask = bin_counts > 0
    fit_wls = sm.WLS(
        bin_correct_means[mask],
        sm.add_constant(bin_prediction_means[mask]),
        weights=bin_counts[mask],
    ).fit()

    corrected_p = [((x - fit_wls.params[0]) / fit_wls.params[1]) for x in p]

This is some really old code from an old version, before we changed binning.
5) Then we hand-wavily assume that this is good enough and we have the true distribution of probabilities of recall.
6) Then we do beta_parameters = beta.fit(corrected_p, floc=0.0, fscale=1.0, method="MLE"). This gives us alpha and beta.
7) Then we do the thing from my previous comment, with sampling from the beta distribution of each user

The idea is to take FSRS predictions, eliminate systematic errors via a linear transformation of its outputs, then assume that this is close enough to the true distribution on probabilities of recall, then get alpha and beta from that distribution.

Btw, I got that the logloss of the Oracle would be 0.27 and the AUC of the Oracle would be 0.83.

@1DWalker
Copy link
Contributor Author

Interesting. Do you have a similar graph for a recent fsrs version?

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

You can plot one here: https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/v5.3.1/fsrs4anki_optimizer.ipynb

It's plotted after optimization

I wonder if we can get alpha and beta from the bins themselves, without using FSRS predictions. It would probably be more accurate.

I mean, if you look at this, it sure as hell looks like a beta distribution. So we can probably somehow skip the part with using algorithmic predictions and just use the bins.

Image

Here's a new one from the optimizer (not my collection, but whatever):

Image

Btw, here's a fun calculator: https://homepage.divms.uiowa.edu/~mbognar/applets/beta.html

Image

@1DWalker
Copy link
Contributor Author

I feel like log loss is too sensitive to what an optimal oracle could actually do. An optimal oracle could perhaps be given a 24/7 feed as to a user's life, be in the know of every emotion that user experiences, have access to neurons in the brain. Then it can predict with near certainty 1.00 or 0.00 for every single review. But this also affects my proposed RMSE (bins) baseline estimation; while proposed method would predict a non-zero baseline, a perfect oracle would indeed achieve an RMSE (bins) of 0. I think we should just discard these ideas. It's too hard to separate systematic errors with things that are actually random for a reasonable non-intrusive algorithm.

By the way, this maybe shows why baseline + 1 power curve (+ ceil) works well for a neural network model. A single power curve probably works poorly for low R if it has to keep up with higher Rs. But FSRS v4's curve is different than FSRS v4.5 and on so maybe it is not the case.

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

When I say "Oracle", I mean "an algorithm that knows the true probability of recall for a given card at a given time". I DO NOT mean "an algorithm that can peek into the human brain, check the status of each neuron, and output exactly 0.0 or 1.0 and nothing inbetween".

And I don't want to discard this because it's fun.

The problem is that the number (and choice) of bins will affect the estimates of alpha and beta, though. Anyway, can you take some random collection from the dataset and give me a list with the number of reviews in each bin, and a list of corresponding average retentions in each bin? Or a python dictionary instead of two lists, whatever. I want to do some curve-fitting.

@1DWalker
Copy link
Contributor Author

I just fear that the baselines will be too inaccurate. As you suggest, there could be more systematic errors that aren't accounted for in a simple predicted R vs actual R model. For example, we can use the bins from RMSE (bins) for a more fine-grained calibration. The possibilities are endless. In the end, the best baseline that we get from this will be the best model that we can come up with. E.g. have you seen if FSRS v4 added together with the calibration produces a better model than FSRS v4 alone? That is, do this calibration on the train set and see if it improves performance on the test set. If so, we may come up with an FSRS v4.1 that incorporates this calibration and is simply a better model. If it is somehow a worse model then we have shown that a calibration is unreliable for reducing systematic errors.

Well I understand the point that it's fun, that's why I made this issue in the first place. That's why we're working on srs-benchmark.

@1DWalker
Copy link
Contributor Author

So you want something like {bin1: (0.90, 16), bin2: (0.95, 23), ...}?

@Expertium
Copy link
Contributor

So you want something like {bin1: (0.90, 16), bin2: (0.95, 23), ...}?

Yep.

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

That is, do this calibration on the train set and see if it improves performance on the test set. If so, we may come up with an FSRS v4.1 that incorporates this calibration and is simply a better model.

I'm not sure if that would be usable in practice, though.
EDIT: ok, no, it definitely wouldn't.
Take a look at this:

Image

Slope = 0.964, intercept = 0.030.

Now let's transform the probability given by FSRS in the following way:
corrected_p = (p-intercept)/slope

What will be the minimum and maximum of corrected_p?
minimum = (0-intercept)/slope = (0-0.03)/0.964 = -0.0311
maximum = (1-intercept)/slope = (1-0.03)/0.964 = 1.006

This means that we can get negative probabilities and/or probabilities greater than 1.

Btw, in my old code I had to do this:
corrected_p = np.clip(corrected_p, 0.0001, 0.9999)

Ok, maybe this approach is crap after all.

BUT, if we estimate the alpha and beta from the bins themselves, without algorithmic predictions, then we're good.

@1DWalker
Copy link
Contributor Author

@Expertium https://github.com/1DWalker/srs-benchmark/tree/bin-stats

You can find the bin statistics for 20 users in the bin-stats folder or you can work with the dicts directly in get_bin_stats.py but spend some time processing.

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

Image

Alright, this looks promising. Kinda.

Code:

import numpy as np
from scipy.optimize import curve_fit
from scipy.special import gamma

def betafunc(x, a, b, scale):
    return scale * gamma(a+b) * (x ** (a-1)) * ((1-x) ** (b-1))  / (gamma(a) * gamma(b))

# dictionary
bin_data = ... 

retentions = []
n_reviews = []
for value in list(bin_data.values()):
    retentions.append(np.clip(value[0], 0.0001, 0.9999))
    n_reviews.append(value[1])

retentions = np.asarray(retentions)
n_reviews = np.asarray(n_reviews)

params, _ = curve_fit(betafunc, retentions, n_reviews, p0=(5, 1.2, 100), bounds=((0.01, 0.01, 0.01), (100, 100, 100_000)))
alpha = params[0]
beta = params[1]
total_reviews = np.sum(n_reviews)

Now, if you dont mind, I want you to use this code to calculate the last three things for each user (all 10k of them): alpha, beta and total_reviews. Then make a .jsonl fine where the output looks like this: {'alpha': alpha, 'beta': beta, 'total_reviews': total_reviews}. If there are NaNs or curve_fit fails or something, just exclude those users.
Then I'll use data from that .jsonl file to run simulations to calculate logloss and AUC of an Oracle.

@Expertium
Copy link
Contributor

Alright, I'll do it myself and then report the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants