Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add percentiles? #156

Open
1DWalker opened this issue Jan 18, 2025 · 6 comments
Open

Add percentiles? #156

1DWalker opened this issue Jan 18, 2025 · 6 comments

Comments

@1DWalker
Copy link
Contributor

Well done on the regularization! Now FSRS-5-recency is stronger on more collections than GRU-P.
Interestingly, GRU-P still does better when we take an average. Maybe GRU-P mostly does better on users that are not representative of the usual anki user, and for those users GRU-P somehow does much better.

I added 0th, 25th, 50th, 75th, 100th percentiles to evaluate.py to check.

Model: GRU-P
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
GRU-P LogLoss (mean±std): 0.3251±0.1508
GRU-P LogLoss quantiles: [0.0001, 0.2202, 0.331, 0.4321, 1.2325]
GRU-P RMSE(bins) (mean±std): 0.0433±0.0288
GRU-P RMSE(bins) quantiles: [0.0001, 0.0252, 0.0366, 0.0533, 0.6223]
GRU-P AUC (mean±std): 0.6991±0.0812
GRU-P AUC quantiles: [0.0044, 0.6615, 0.7002, 0.7398, 0.9938]

Model: FSRS-5-recency
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency LogLoss (mean±std): 0.3256±0.1519
FSRS-5-recency LogLoss quantiles: [0.0007, 0.2192, 0.3305, 0.4325, 1.2199]
FSRS-5-recency RMSE(bins) (mean±std): 0.0493±0.0321
FSRS-5-recency RMSE(bins) quantiles: [0.0012, 0.0289, 0.0421, 0.0606, 0.4159]
FSRS-5-recency AUC (mean±std): 0.7056±0.0755
FSRS-5-recency AUC quantiles: [0.0021, 0.6664, 0.705, 0.7488, 0.9979]

Here's a version where I use 11 equally spaced percentiles instead for weighted by reviews:
GRU-P:
[0.0001, 0.1156, 0.1945, 0.247, 0.2881, 0.331, 0.3673, 0.4099, 0.4591, 0.5227, 1.2325]
FSRS-5-recency:
[0.0007, 0.1151, 0.1932, 0.247, 0.2892, 0.3305, 0.3677, 0.4108, 0.4569, 0.5268, 1.2199]

Should we add some of this information to the table? At least the median?
And if we want this to be included in evaluate.py then let me know if the information should be formatted differently.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 18, 2025

What about plotting the distribution of the metrics? Something like box plot or violin plot.

@1DWalker
Copy link
Contributor Author

Sure, I'll work on that.

@Expertium
Copy link
Contributor

Yeah, the table has too many numbers already. Distribution plots would be nicer.

@1DWalker
Copy link
Contributor Author

GRU-P-short (blue) vs FSRS-5-recency on
Log loss:
Image

RMSE (bins):
Image

I might look into plotting the difference of the two models so that percentile trends will be more apparent.

@Expertium
Copy link
Contributor

Expertium commented Jan 18, 2025

Hell nah, not the vagina violin plots. Just make histograms. Also, it's not clear which color represents which algorithm. I mean, you said it, but it should be labeled on the image itself.

@1DWalker
Copy link
Contributor Author

Yeah don't worry about the specifics like labels until a PR comes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants