Benchmark / Performance Testing #2

flokli · 2021-12-10T13:53:54Z

We should play with casync chunk size parameters, and see how well it deduplicates.

We should also play with CPU utilization and parallelity.

flokli · 2022-01-13T23:57:28Z

@rickynils provided some promising numbers at https://discourse.nixos.org/t/nix-casync-a-more-efficient-way-to-store-and-substitute-nix-store-paths/16539/3.

#35 brings configurable chunking size (and some more concurrency), so maybe we can re-run this with some slightly different chunk sizes?

rickynils · 2022-01-14T08:24:48Z

@flokli I can re-run the nix-casync ingestion on the same data set I used before to see if any storage gains can be made by using different chunk sizes. It probably takes a couple of days running it, so I can't test with a large number of chunk sizes. What sizes would you like me to try out?

rickynils · 2022-01-14T09:38:31Z

@flokli Btw, you asked for the distribution of chunk sizes. I've compiled a csv file on this format:

13875,40468,/rpool/casync-test/castr/0004/0004f2c61180e83eb965b349927d6e08bba6f0b6a595118502732a52a7d52512.cacnk
36751,79176,/rpool/casync-test/castr/0004/0004203db6082841ca81798bef756a01e65220b8a2d4b62bc8b7f648603d6af7.cacnk
9943,99748,/rpool/casync-test/castr/0004/000412d9ce0935f8cf074dcf676550512548f4874e4039bf9ec8649b7a6365d2.cacnk
18596,160993,/rpool/casync-test/castr/0004/000410c681b2016db11262973734d3981bea5ecabdafbd947d9fbafccdfca73c.cacnk
7240,25549,/rpool/casync-test/castr/0004/000454ea2beeca574f6227405cabc972f266c80f13461584b797121fb172061e.cacnk

The first column is the compressed chunk size (in bytes), second column uncompressed. There might be minor rounding errors in the byte counts since these numbers was derived from floating-point kilobyte numbers outputted by zstd --list.

In my post on discourse I stated that the sum of the compressed chunks was 1223264 MB, but if you sum up the sizes from my csv file you actually get 1044006 MB (15% less). This is because the number in the discourse post includes disk/fs overhead (block alignment etc), but the numbers in the csv file are the "raw" chunk sizes.

The complete csv file is 1.4 GB large zstd-compressed. I haven't done any analysis on it, but I've pushed it to my Cachix cache named rickynils. You can fetch it by doing:

nix-store -r /nix/store/2pr5achd242cna6qfk086qy0ffxgsyv2-cacnk-sizes.csv.zstd

flokli · 2022-01-14T12:20:26Z

Thanks! Let's do some small analysis before trying with another chunking sizes.

Some things that'd be good to know:

Some histogram on the distribution of the (uncompressed) chunk sizes. Do we often end up with the maximum chunk size, minimum chunk size? That'd give some insight in how we want to tune those numbers. It might be interesting to compare that distribution with other chunking sizes (but let's wait until we saw that data)
How effective is the compression? (comparison of compressed/uncompressed chunk size). We might want to compress chunks more agressively
An analysis of the "bucketing" folder structure. Right now, everything that shares the same 4 characters gets put in the same folder. How many elements did end up in every of those directories, how would it look like if we would use only the first two for example?

If someone beats me with producing this report (ideally in a script that can be easily run against other similar experiments too), I wouldn't mind ;-)

rickynils · 2022-01-14T12:38:28Z

Looking at the total sum of uncompressed vs compressed (raw bytes) chunk sizes, compression ratio lands on 2.38. This is roughly in line with the ZFS zstd compression of 2.69 on the same data set. I don't know if the zstd settings used by default in nix-casync vs ZFS differs.

rickynils · 2022-01-14T12:44:52Z

The chunks in the csv-file was produced by nix-casync revision 25eb0e5

flokli · 2022-01-15T21:31:15Z

A friend of mine not very active on GitHub made a quick analysis:

nix-casync chunk size analysis.pdf

flokli mentioned this issue Jan 19, 2022

Strip nix store references from chunks and store them in metadata instead #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark / Performance Testing #2

Benchmark / Performance Testing #2

flokli commented Dec 10, 2021

flokli commented Jan 13, 2022

rickynils commented Jan 14, 2022

rickynils commented Jan 14, 2022

flokli commented Jan 14, 2022 •

edited

Loading

rickynils commented Jan 14, 2022

rickynils commented Jan 14, 2022

flokli commented Jan 15, 2022

Benchmark / Performance Testing #2

Benchmark / Performance Testing #2

Comments

flokli commented Dec 10, 2021

flokli commented Jan 13, 2022

rickynils commented Jan 14, 2022

rickynils commented Jan 14, 2022

flokli commented Jan 14, 2022 • edited Loading

rickynils commented Jan 14, 2022

rickynils commented Jan 14, 2022

flokli commented Jan 15, 2022

flokli commented Jan 14, 2022 •

edited

Loading