Add GRU-P-secs to the table #149

1DWalker · 2025-01-07T23:02:22Z

It would be interesting to see how exactly -secs can benefit a model like GRU-P. With this information and combining with -short, it can be inferred how much potential there is for fsrs to incorporate same-day reviews with proper second-resolution data.

-secs is incomparable with non -secs due how the preprocessing is done, resulting in a different number of reviews that are considered. -secs also considers 1 more user than non -secs. A possible solution would be to evaluate -secs only on reviews that would be evaluated with a non -secs model.

`Model: GRU-P-short
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
GRU-P-short LogLoss (mean±std): 0.3195±0.1469
GRU-P-short RMSE(bins) (mean±std): 0.0421±0.0288
GRU-P-short AUC (mean±std): 0.7096±0.0815

Model: GRU-P-secs
Total number of users: 10000
Total number of reviews: 519296315
Weighted average by reviews:
GRU-P-secs LogLoss (mean±std): 0.3293±0.1430
GRU-P-secs RMSE(bins) (mean±std): 0.0495±0.0266
GRU-P-secs AUC (mean±std): 0.7409±0.0744

Expertium · 2025-01-07T23:06:07Z

It doesn't perform better, as you can tell by logloss and RMSE (AUC is a bit better, though). In fact, you can try this with any other model and you will see that -secs models have worse metrics.
Considering that even GRU-P-secs doesn't outperform it's non-seqcs variant despite being very flexible, it seems like same-day reviews just suck.

1DWalker · 2025-01-07T23:37:07Z

I don't think that they are directly comparable since the review counts don't even match up. In theory, GRU-P-short uses short-term data but GRU-P-secs does not have access to this short-term data. As you have suggested elsewhere, for review steps of >= 1 days secs is useless, but it can be argued that when -secs and -short are combined the benefit will be greater than with -short alone. From my current understanding GRU-P-short is just given a tensor indicating short term data with no time information, such as [0, 3, 3, 0, 2, 3] which are just what the user presses on Anki. No information whatsoever is provided as to the time of this information.

Expertium · 2025-01-07T23:50:58Z

@L-M-Sherlock how about a -shortsecs (or whatever you wanna call it) model?
-short:

Doesn't predict the probability of recall for same-day reviews
Doesn't have access to real interval lengths for same-day reviews

-secs:

Does predict the probability of recall for same-day reviews
Does have access to real interval lengths for same-day reviews

-shortsecs:

Doesn't predict the probability of recall (like -short)
Does have access to real interval lengths (like -secs)

If you do that, I'll work on the formulas. We'll, provided I figure out how to run the damn thing (I hate the benchmark code so much it's unreal)

L-M-Sherlock · 2025-01-08T03:52:47Z

I have added something to make them comparable:

https://forums.ankiweb.net/t/due-column-changing-days-from-whole-numbers-to-decimals-in-scheduling/52213/36?u=l.m.sherlock

Expertium · 2025-01-08T14:14:39Z

That's basically what I'm asking for, right? Don't measure the accuracy of predicted R for same-day reviews while still using real interval lengths, right?

Expertium · 2025-01-09T13:16:47Z

@L-M-Sherlock am I right? (see above)

L-M-Sherlock · 2025-01-09T13:36:40Z

Yep.

Expertium · 2025-01-10T22:06:20Z

FSRS-5 --secs --no_test_same_day is much worse than just FSRS-5. I don't know why

I was expecting them to be roughly the same, since it's basically just FSRS-5: same formulas, nothing new, just fractional intervals. Also, the number of reviews isn't exactly the same.
@L-M-Sherlock any idea what might be going on?
Also, I'll run FSRS-5 --secs to see how much it differs from FSRS-5 --secs --no_test_same_day

L-M-Sherlock · 2025-01-11T04:20:39Z

any idea what might be going on?

Obviously, FSRS-5-secs improves its ability to predict the short-term memory at the cost of the ability to predict the long-term memory.

Expertium · 2025-01-11T11:29:15Z

But it's the same formulas. I didn't change the math. I don't understand why there is such a big difference.

L-M-Sherlock · 2025-01-11T11:48:14Z

Oops, I get it. You need replace "delta_t" with "elapsed_days" here:

srs-benchmark/other.py

Lines 2498 to 2500 in e1a5ba4

    
           if SHORT_TERM: 
        
               df = df[(df["delta_t"] != 0) | (df["i"] == 1)].copy() 
        
               df["i"] = df.groupby("card_id").cumcount() + 1

Then FSRS-5 will not predict the probability of recall for same-day reviews during traning.

1DWalker · 2025-01-11T22:18:55Z

The issue still exists. With your suggested change, df will only contain values with elapsed_days > 0 which makes it so that --no_test_same_day does nothing.

srs-benchmark/other.py

Lines 2591 to 2592 in e1a5ba4

    
           if NO_TEST_SAME_DAY: 
        
               test_set = test_set[test_set["elapsed_days"] > 0].copy()

I think it may have to do with this filtering step that I don't understand as this reduces the size of the dataframe

srs-benchmark/other.py

Lines 2501 to 2513 in e1a5ba4

    
           if not SECS_IVL: 
        
               filtered_dataset = ( 
        
                   df[df["i"] == 2] 
        
                   .groupby(by=["first_rating"], as_index=False, group_keys=False)[df.columns] 
        
                   .apply(remove_outliers) 
        
               ) 
        
               if filtered_dataset.empty: 
        
                   return pd.DataFrame() 
        
               df[df["i"] == 2] = filtered_dataset 
        
               df.dropna(inplace=True) 
        
               df = df.groupby("card_id", as_index=False, group_keys=False)[df.columns].apply( 
        
                   remove_non_continuous_rows 
        
               )

Expertium · 2025-01-12T12:58:57Z

@L-M-Sherlock here's what I recommend:

Make it so that --secs only affects intervals (integer vs float) and NOTHING else
Rename no_test_same_day to predict_same_day_r and make it so that if it's enabled, predicted R for same-day reviews is used both during training and evaluation. If disabled, predicted R for same-day reviews is not used. Neither for training, nor for evaluation.

L-M-Sherlock · 2025-01-13T09:39:01Z

Make it so that --secs only affects intervals (integer vs float) and NOTHING else

It's impossible to apply the current remove_outliers to --secs case. So I don't know how to make it only affects intervals and NOTHING else, unless remove_outliers is removed from all benchmark.

L-M-Sherlock · 2025-01-13T09:53:15Z

any idea what might be going on?

Another reason: FSRS treats reviews with elapsed days < 1 as short-term reviews. When the intervals are float, they are not rounded, so some long-term reviews (the elapsed days are shorter than 1 day but across a sleep time) are treated as short-term reviews.

L-M-Sherlock · 2025-01-13T10:51:30Z

After 09d5c4a, the gap between FSRS-5-secs and FSRS-5 is not that huge:

Model: FSRS-5-secs
Total number of users: 1990
Total number of reviews: 69187535
Weighted average by reviews:
FSRS-5-secs LogLoss (mean±std): 0.3537±0.1622
FSRS-5-secs RMSE(bins) (mean±std): 0.0686±0.0393
FSRS-5-secs AUC (mean±std): 0.6968±0.0785

Weighted average by log(reviews):
FSRS-5-secs LogLoss (mean±std): 0.3760±0.1781
FSRS-5-secs RMSE(bins) (mean±std): 0.0887±0.0523
FSRS-5-secs AUC (mean±std): 0.6940±0.0864

Weighted average by users:
FSRS-5-secs LogLoss (mean±std): 0.3790±0.1809
FSRS-5-secs RMSE(bins) (mean±std): 0.0918±0.0539
FSRS-5-secs AUC (mean±std): 0.6936±0.0885

parameters: [0.43215, 1.36565, 3.37595, 19.0261, 7.15305, 0.5207, 1.71375, 0.01115, 1.5362, 0.1723, 1.01335, 1.9256, 0.10845, 0.28325, 2.27665, 0.2308, 2.9898, 0.4275, 0.8216]

Model: FSRS-5
Total number of users: 1990
Total number of reviews: 66177965
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.3415±0.1558
FSRS-5 RMSE(bins) (mean±std): 0.0546±0.0334
FSRS-5 AUC (mean±std): 0.7047±0.0813

Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3590±0.1681
FSRS-5 RMSE(bins) (mean±std): 0.0712±0.0445
FSRS-5 AUC (mean±std): 0.7011±0.0878

Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3614±0.1704
FSRS-5 RMSE(bins) (mean±std): 0.0737±0.0459
FSRS-5 AUC (mean±std): 0.7005±0.0898

parameters: [0.4208, 1.1358, 3.00825, 15.49955, 7.1794, 0.54405, 1.7145, 0.00635, 1.51585, 0.1256, 1.00245, 1.9359, 0.1069, 0.2932, 2.27565, 0.23055, 2.9898, 0.4591, 0.63255]

Expertium · 2025-01-13T12:49:21Z

I still think that it would be best to make the number of reviews exactly equal. But ok, I'll see if I can improve short-term S formulas now. So do I need --secs --no_test_same_day?

1DWalker · 2025-01-13T16:26:55Z

The -secs and non -secs use a different number of reviews so it's an apples to oranges comparison. I would rather go down the route of not removing outliers so that the reviews that are compared are actually the same and so that we properly benchmark algorithms that use all the available short-term information.

Expertium · 2025-01-13T21:51:50Z

@L-M-Sherlock I agree with 1DWalker. Let's remove the outlier filter if that's the case. I really want to try FSRS that uses fractional interval lengths but doesn't predict R for same-day reviews, neither in training nor in evaluation.

1DWalker · 2025-01-13T22:22:18Z

Each review can already be uniquely identified. Maybe we can just mark which reviews would be removed by the filter and exclude them when testing, similar to what was attempted for --no_test_same_day?

L-M-Sherlock · 2025-01-14T01:48:27Z

I prefer to remove the filter entirely. It's more convenient to me. If you have another solution, PR is welcome.

1DWalker · 2025-01-14T02:24:28Z

Removing the filter would mean we would need to rebuild the entire benchmark. I'll work on the issue.

1DWalker · 2025-01-15T20:48:00Z

There is still a bug with --secs where rmse is not properly calculated due to the rmse calculation being dependent on delta_t which is modified by --secs.
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/main/src/fsrs_optimizer/fsrs_optimizer.py#L2156

As an example I made a model that only predicts a constant 90% retention.

Normal:
{"metrics": {"RMSE": 0.487103, "LogLoss": 0.729563, "RMSE(bins)": 0.286898, "ICI": 0.1, "AUC": 0.5}, "user": 1, "size": 10620}
{"metrics": {"RMSE": 0.38849, "LogLoss": 0.492415, "RMSE(bins)": 0.125703, "ICI": 0.1, "AUC": 0.5}, "user": 2, "size": 35900}
{"metrics": {"RMSE": 0.269809, "LogLoss": 0.277834, "RMSE(bins)": 0.060215, "ICI": 0.1, "AUC": 0.5}, "user": 3, "size": 4255}
{"metrics": {"RMSE": 0.374166, "LogLoss": 0.462409, "RMSE(bins)": 0.131994, "ICI": 0.1, "AUC": 0.5}, "user": 4, "size": 7360}
{"metrics": {"RMSE": 0.423296, "LogLoss": 0.570017, "RMSE(bins)": 0.146756, "ICI": 0.1, "AUC": 0.5}, "user": 5, "size": 6275}

--secs --equalize_test_with_non_secs:
{"metrics": {"RMSE": 0.487103, "LogLoss": 0.729563, "RMSE(bins)": 0.286324, "ICI": 0.1, "AUC": 0.5}, "user": 1, "size": 10620}
{"metrics": {"RMSE": 0.38849, "LogLoss": 0.492415, "RMSE(bins)": 0.133124, "ICI": 0.1, "AUC": 0.5}, "user": 2, "size": 35900}
{"metrics": {"RMSE": 0.269809, "LogLoss": 0.277834, "RMSE(bins)": 0.067849, "ICI": 0.1, "AUC": 0.5}, "user": 3, "size": 4255}
{"metrics": {"RMSE": 0.374166, "LogLoss": 0.462409, "RMSE(bins)": 0.143521, "ICI": 0.1, "AUC": 0.5}, "user": 4, "size": 7360}
{"metrics": {"RMSE": 0.423296, "LogLoss": 0.570017, "RMSE(bins)": 0.17402, "ICI": 0.1, "AUC": 0.5}, "user": 5, "size": 6275}

See how while the LogLoss are exactly the same, RMSE(bins) are different. I'll work on a fix.

1DWalker mentioned this issue Jan 14, 2025

Add secs tests equalization #152

Merged

1DWalker mentioned this issue Jan 15, 2025

Fixed rmse for --secs --equalize_test_with_non_secs #153

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GRU-P-secs to the table #149

Add GRU-P-secs to the table #149

1DWalker commented Jan 7, 2025

Expertium commented Jan 7, 2025

1DWalker commented Jan 7, 2025

Expertium commented Jan 7, 2025

L-M-Sherlock commented Jan 8, 2025

Expertium commented Jan 8, 2025

Expertium commented Jan 9, 2025

L-M-Sherlock commented Jan 9, 2025

Expertium commented Jan 10, 2025 •

edited

Loading

L-M-Sherlock commented Jan 11, 2025

Expertium commented Jan 11, 2025

L-M-Sherlock commented Jan 11, 2025

1DWalker commented Jan 11, 2025

Expertium commented Jan 12, 2025

L-M-Sherlock commented Jan 13, 2025 •

edited

Loading

L-M-Sherlock commented Jan 13, 2025

L-M-Sherlock commented Jan 13, 2025

Expertium commented Jan 13, 2025 •

edited

Loading

1DWalker commented Jan 13, 2025

Expertium commented Jan 13, 2025

1DWalker commented Jan 13, 2025

L-M-Sherlock commented Jan 14, 2025

1DWalker commented Jan 14, 2025

1DWalker commented Jan 15, 2025

Add GRU-P-secs to the table #149

Add GRU-P-secs to the table #149

Comments

1DWalker commented Jan 7, 2025

Expertium commented Jan 7, 2025

1DWalker commented Jan 7, 2025

Expertium commented Jan 7, 2025

L-M-Sherlock commented Jan 8, 2025

Expertium commented Jan 8, 2025

Expertium commented Jan 9, 2025

L-M-Sherlock commented Jan 9, 2025

Expertium commented Jan 10, 2025 • edited Loading

L-M-Sherlock commented Jan 11, 2025

Expertium commented Jan 11, 2025

L-M-Sherlock commented Jan 11, 2025

1DWalker commented Jan 11, 2025

Expertium commented Jan 12, 2025

L-M-Sherlock commented Jan 13, 2025 • edited Loading

L-M-Sherlock commented Jan 13, 2025

L-M-Sherlock commented Jan 13, 2025

Expertium commented Jan 13, 2025 • edited Loading

1DWalker commented Jan 13, 2025

Expertium commented Jan 13, 2025

1DWalker commented Jan 13, 2025

L-M-Sherlock commented Jan 14, 2025

1DWalker commented Jan 14, 2025

1DWalker commented Jan 15, 2025

Expertium commented Jan 10, 2025 •

edited

Loading

L-M-Sherlock commented Jan 13, 2025 •

edited

Loading

Expertium commented Jan 13, 2025 •

edited

Loading