-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GRU-P-secs to the table #149
Comments
It doesn't perform better, as you can tell by logloss and RMSE (AUC is a bit better, though). In fact, you can try this with any other model and you will see that -secs models have worse metrics. |
I don't think that they are directly comparable since the review counts don't even match up. In theory, GRU-P-short uses short-term data but GRU-P-secs does not have access to this short-term data. As you have suggested elsewhere, for review steps of >= 1 days secs is useless, but it can be argued that when -secs and -short are combined the benefit will be greater than with -short alone. From my current understanding GRU-P-short is just given a tensor indicating short term data with no time information, such as [0, 3, 3, 0, 2, 3] which are just what the user presses on Anki. No information whatsoever is provided as to the time of this information. |
@L-M-Sherlock how about a -shortsecs (or whatever you wanna call it) model?
-secs:
-shortsecs:
If you do that, I'll work on the formulas. We'll, provided I figure out how to run the damn thing (I hate the benchmark code so much it's unreal) |
I have added something to make them comparable: |
That's basically what I'm asking for, right? Don't measure the accuracy of predicted R for same-day reviews while still using real interval lengths, right? |
@L-M-Sherlock am I right? (see above) |
Yep. |
FSRS-5 --secs --no_test_same_day is much worse than just FSRS-5. I don't know why I was expecting them to be roughly the same, since it's basically just FSRS-5: same formulas, nothing new, just fractional intervals. Also, the number of reviews isn't exactly the same. |
Obviously, FSRS-5-secs improves its ability to predict the short-term memory at the cost of the ability to predict the long-term memory. |
But it's the same formulas. I didn't change the math. I don't understand why there is such a big difference. |
Oops, I get it. You need replace "delta_t" with "elapsed_days" here: Lines 2498 to 2500 in e1a5ba4
Then FSRS-5 will not predict the probability of recall for same-day reviews during traning. |
The issue still exists. With your suggested change, df will only contain values with elapsed_days > 0 which makes it so that --no_test_same_day does nothing. Lines 2591 to 2592 in e1a5ba4
I think it may have to do with this filtering step that I don't understand as this reduces the size of the dataframe Lines 2501 to 2513 in e1a5ba4
|
@L-M-Sherlock here's what I recommend:
|
It's impossible to apply the current |
Another reason: FSRS treats reviews with elapsed days < 1 as short-term reviews. When the intervals are float, they are not rounded, so some long-term reviews (the elapsed days are shorter than 1 day but across a sleep time) are treated as short-term reviews. |
After 09d5c4a, the gap between FSRS-5-secs and FSRS-5 is not that huge:
|
I still think that it would be best to make the number of reviews exactly equal. But ok, I'll see if I can improve short-term S formulas now. So do I need |
The -secs and non -secs use a different number of reviews so it's an apples to oranges comparison. I would rather go down the route of not removing outliers so that the reviews that are compared are actually the same and so that we properly benchmark algorithms that use all the available short-term information. |
@L-M-Sherlock I agree with 1DWalker. Let's remove the outlier filter if that's the case. I really want to try FSRS that uses fractional interval lengths but doesn't predict R for same-day reviews, neither in training nor in evaluation. |
Each review can already be uniquely identified. Maybe we can just mark which reviews would be removed by the filter and exclude them when testing, similar to what was attempted for |
I prefer to remove the filter entirely. It's more convenient to me. If you have another solution, PR is welcome. |
Removing the filter would mean we would need to rebuild the entire benchmark. I'll work on the issue. |
There is still a bug with As an example I made a model that only predicts a constant 90% retention. Normal: --secs --equalize_test_with_non_secs: See how while the LogLoss are exactly the same, RMSE(bins) are different. I'll work on a fix. |
It would be interesting to see how exactly -secs can benefit a model like GRU-P. With this information and combining with -short, it can be inferred how much potential there is for fsrs to incorporate same-day reviews with proper second-resolution data.
-secs is incomparable with non -secs due how the preprocessing is done, resulting in a different number of reviews that are considered. -secs also considers 1 more user than non -secs. A possible solution would be to evaluate -secs only on reviews that would be evaluated with a non -secs model.
`Model: GRU-P-short
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
GRU-P-short LogLoss (mean±std): 0.3195±0.1469
GRU-P-short RMSE(bins) (mean±std): 0.0421±0.0288
GRU-P-short AUC (mean±std): 0.7096±0.0815
Model: GRU-P-secs
Total number of users: 10000
Total number of reviews: 519296315
Weighted average by reviews:
GRU-P-secs LogLoss (mean±std): 0.3293±0.1430
GRU-P-secs RMSE(bins) (mean±std): 0.0495±0.0266
GRU-P-secs AUC (mean±std): 0.7409±0.0744
The text was updated successfully, but these errors were encountered: