Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

Benchmark / Performance Testing #2

Open
flokli opened this issue Dec 10, 2021 · 7 comments
Open

Benchmark / Performance Testing #2

flokli opened this issue Dec 10, 2021 · 7 comments

Comments

@flokli
Copy link
Owner

flokli commented Dec 10, 2021

We should play with casync chunk size parameters, and see how well it deduplicates.

We should also play with CPU utilization and parallelity.

@flokli
Copy link
Owner Author

flokli commented Jan 13, 2022

@rickynils provided some promising numbers at https://discourse.nixos.org/t/nix-casync-a-more-efficient-way-to-store-and-substitute-nix-store-paths/16539/3.

#35 brings configurable chunking size (and some more concurrency), so maybe we can re-run this with some slightly different chunk sizes?

@rickynils
Copy link

@flokli I can re-run the nix-casync ingestion on the same data set I used before to see if any storage gains can be made by using different chunk sizes. It probably takes a couple of days running it, so I can't test with a large number of chunk sizes. What sizes would you like me to try out?

@rickynils
Copy link

@flokli Btw, you asked for the distribution of chunk sizes. I've compiled a csv file on this format:

13875,40468,/rpool/casync-test/castr/0004/0004f2c61180e83eb965b349927d6e08bba6f0b6a595118502732a52a7d52512.cacnk
36751,79176,/rpool/casync-test/castr/0004/0004203db6082841ca81798bef756a01e65220b8a2d4b62bc8b7f648603d6af7.cacnk
9943,99748,/rpool/casync-test/castr/0004/000412d9ce0935f8cf074dcf676550512548f4874e4039bf9ec8649b7a6365d2.cacnk
18596,160993,/rpool/casync-test/castr/0004/000410c681b2016db11262973734d3981bea5ecabdafbd947d9fbafccdfca73c.cacnk
7240,25549,/rpool/casync-test/castr/0004/000454ea2beeca574f6227405cabc972f266c80f13461584b797121fb172061e.cacnk

The first column is the compressed chunk size (in bytes), second column uncompressed. There might be minor rounding errors in the byte counts since these numbers was derived from floating-point kilobyte numbers outputted by zstd --list.

In my post on discourse I stated that the sum of the compressed chunks was 1223264 MB, but if you sum up the sizes from my csv file you actually get 1044006 MB (15% less). This is because the number in the discourse post includes disk/fs overhead (block alignment etc), but the numbers in the csv file are the "raw" chunk sizes.

The complete csv file is 1.4 GB large zstd-compressed. I haven't done any analysis on it, but I've pushed it to my Cachix cache named rickynils. You can fetch it by doing:

nix-store -r /nix/store/2pr5achd242cna6qfk086qy0ffxgsyv2-cacnk-sizes.csv.zstd

@flokli
Copy link
Owner Author

flokli commented Jan 14, 2022

Thanks! Let's do some small analysis before trying with another chunking sizes.

Some things that'd be good to know:

  • Some histogram on the distribution of the (uncompressed) chunk sizes. Do we often end up with the maximum chunk size, minimum chunk size? That'd give some insight in how we want to tune those numbers. It might be interesting to compare that distribution with other chunking sizes (but let's wait until we saw that data)
  • How effective is the compression? (comparison of compressed/uncompressed chunk size). We might want to compress chunks more agressively
  • An analysis of the "bucketing" folder structure. Right now, everything that shares the same 4 characters gets put in the same folder. How many elements did end up in every of those directories, how would it look like if we would use only the first two for example?

If someone beats me with producing this report (ideally in a script that can be easily run against other similar experiments too), I wouldn't mind ;-)

@rickynils
Copy link

Looking at the total sum of uncompressed vs compressed (raw bytes) chunk sizes, compression ratio lands on 2.38. This is roughly in line with the ZFS zstd compression of 2.69 on the same data set. I don't know if the zstd settings used by default in nix-casync vs ZFS differs.

@rickynils
Copy link

The chunks in the csv-file was produced by nix-casync revision 25eb0e5

@flokli
Copy link
Owner Author

flokli commented Jan 15, 2022

A friend of mine not very active on GitHub made a quick analysis:

nix-casync chunk size analysis.pdf

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants