-
-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--reuse-media
creates an unacceptably high chance of a collision
#1231
Comments
--reuse-media
creates an unacceptably high chance of a --reuse-media
creates an unacceptably high chance of a collision
An alternative suggestion to lengthening the hash would be to just put the attachment ID in the filename. I feel like at a certain point, a bunch of random base-32 characters doesn't really provide much utility over 19 digits (which would be guaranteed to never collide). |
Improve filenames for downloaded assets. Fixes Tyrrrz#1231
Pasted from #1232
Here's a set of images that end up with the same file names:
unknown-A4DB4.png:
unknown-F1191.png:
unknown-7DBD3.png:
|
hmm that's concerning, considering I use this on bigger discord as well where collisions would probably be likely |
Improve filenames for downloaded assets. Fixes Tyrrrz#1231
Improve filenames for downloaded assets. Fixes Tyrrrz#1231
Improve filenames for downloaded assets. Fixes Tyrrrz#1231
Version
v2.42.8
Flavor
GUI (Graphical User Interface), CLI (Command-Line Interface)
Platform
Linux
Export format
HTML, TXT, JSON, CSV
Steps to reproduce
Export servers where large numbers of videos, images and especially screenshots are often posted. Observe that files named
image.png
orunknown.png
are extremely common due to copy-paste of images directly into Discord, often leaving no file name available.Do math to confirm the birthday problem is at play and will be causing you problems.
Details
The algorithm for making file names for downloaded media only stores 20 bytes of the hash of the URL. This is insufficient.
Despite there being 1048576 possible hashes, it only takes 1200 images for there to be a 50% chance of a collision and 500 images before there is a 10% chance of a collision due to the birthday problem. The original reasoning in the issue where this code was added (#395) is incorrect due to the commonness of
unknown.png
,image.png
, and other filenames.A list of all filenames I've seen that could reach that 10% chance of a collision in just my own exports of servers I'm in:
As can clearly be seen, file names are not evenly distributed. Notably,
maxresdefault
,sddefault
andhqdefault
are not being posted intentionally by users at all, and are instead the result of youtube thumbnails. Similarly,tenor
happens because the tenor gif button is pressed in Discord's own UI.There is a near 100% chance that there are collisions here between
unknown.png
,image.png
, andmaxresdefault.jpg
. You can check with this calculator: https://www.bdayprob.com/. SolveP(D,N)
withD = 1048576
andN = 6637
.10 characters provides a better margin of safety: 15000 images with the same name before there is even a 0.01% chance, which would have successfully handled my use case.
10 characters of base-32 rather than base-16 would provide 600000 images before there is the same chance of collision. 8 characters of base-32 would provide the same margin as the 10 characters of base-16.
Checklist
The text was updated successfully, but these errors were encountered: