-
Notifications
You must be signed in to change notification settings - Fork 4
Benchmark / Performance Testing #2
Comments
@rickynils provided some promising numbers at https://discourse.nixos.org/t/nix-casync-a-more-efficient-way-to-store-and-substitute-nix-store-paths/16539/3. #35 brings configurable chunking size (and some more concurrency), so maybe we can re-run this with some slightly different chunk sizes? |
@flokli I can re-run the nix-casync ingestion on the same data set I used before to see if any storage gains can be made by using different chunk sizes. It probably takes a couple of days running it, so I can't test with a large number of chunk sizes. What sizes would you like me to try out? |
@flokli Btw, you asked for the distribution of chunk sizes. I've compiled a csv file on this format:
The first column is the compressed chunk size (in bytes), second column uncompressed. There might be minor rounding errors in the byte counts since these numbers was derived from floating-point kilobyte numbers outputted by In my post on discourse I stated that the sum of the compressed chunks was 1223264 MB, but if you sum up the sizes from my csv file you actually get 1044006 MB (15% less). This is because the number in the discourse post includes disk/fs overhead (block alignment etc), but the numbers in the csv file are the "raw" chunk sizes. The complete csv file is 1.4 GB large zstd-compressed. I haven't done any analysis on it, but I've pushed it to my Cachix cache named
|
Thanks! Let's do some small analysis before trying with another chunking sizes. Some things that'd be good to know:
If someone beats me with producing this report (ideally in a script that can be easily run against other similar experiments too), I wouldn't mind ;-) |
Looking at the total sum of uncompressed vs compressed (raw bytes) chunk sizes, compression ratio lands on |
The chunks in the csv-file was produced by |
A friend of mine not very active on GitHub made a quick analysis: |
We should play with casync chunk size parameters, and see how well it deduplicates.
We should also play with CPU utilization and parallelity.
The text was updated successfully, but these errors were encountered: