Deduplication Strategies #5

blaggacao · 2023-01-02T01:43:18Z

As proposed there by @wamserma and also recently by @wmertens in discourse.

Edit: further references of this idea in next comment down below.

Before casync or any generic dedup strategy

... we can likely exploit the (presumed & yet unmetered) property that a lot of re -builds may not actually change a lot of bytes, but the store path references to other dependencies, and thus creating an almost identical new store entry.

Possible Solution

When zeroing (/nix/store/0000000...) these store references creates identical artifacts, we have a potent initial and cheap deduplication exclusive to store based systems.

The true references would need storing via some metadata schema and cheap substitution on-the-fly when serving a given concrete store entry.

Quantification Necessary

The purpose of this issue is to anchor this strategy in the context of attic, since the roadmap states potential future use of deduplication strategies, specifically mentioning casync.

Unfortunately, I can't provide any estimates or quantification of the benefits other than the qualitative argument that this can be an interesting and cheap approach to explore before more advanced dedup strategies may be considered.

Still, I think it is worth mentioning in this context.

The text was updated successfully, but these errors were encountered:

blaggacao · 2023-01-02T01:55:44Z

Further references

The spongix nix cache implementation already uses desync (which has an interesting readme!), a casync implementation in go.

For interested parties, here is the introduction blog post for casync, which explains the algorithm: https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

And the really deep dive is here: https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/

Stripping of store references has also been discussed in the context of nix-casync.

And the introductory blog post is also a good entrypoint from a Nix perspective.

zhaofengli · 2023-01-15T22:17:53Z

FastCDC-based chunking has been added in e8f9f3c ¹. In this new model, NARs are backed by a sequence of content-addressed chunks in the Global Chunk Store. Newly-uploaded NARs will be split into chunks with FastCDC and only new chunks will be uploaded to the storage backend. NARs that have existed prior to chunking will be converted to have a single chunk.

For example, this works reasonably well for huge unfree paths that are rebuilt without version change (e.g., Zoom and VSCode), even with a very large chunk size. I have some simple numbers on here and will post more later. It also works even if there are some differences (e.g., llvm-13.0.0-lib -> llvm-13.0.1-lib). The default uses 128 KiB as the average chunk size, and I will add atticadm test-chunking so chunk sizes can be easily fine-tuned.

I'm leaving this issue open so other approaches can be explored.

Some relevant FAQs:

Why chunk NARs instead of individual files?

In the current design, chunking is applied to the entire uncompressed NAR file instead of individual constituent files in the NAR. Big NARs that benefit the most from chunk-based deduplication (e.g., VSCode, Zoom) often have hundreds or thousands of small files. During NAR reassembly, it's often uneconomical or impractical to fetch thousands of files to reconstruct the NAR in a scalable way. By chunking the entire NAR, it's possible to configure the average chunk size to a larger value, ignoring file boundaries and lumping small files together. This is also the approach casync has taken.

You may have heard that the Tvix store protocol chunks individual files instead of the NAR. The design of Attic is driven by the desire to effectively utilize existing platforms with practical limitations², while looking forward to the future.

Why not just erase store path references?

At first glance, erasing store path references and storing them separately seems easy but it's actually difficult to do in practice. It makes NAR assembly difficult (not a simple concatenation of chunks anymore) and the immediate benefits from doing so alone are minimal. Many files have other differences, like a minor version upgrade or .note.gnu.build-id. The files that this approach works with (mostly small paths like configurations) don't have much overhead to begin with. Therefore I opted for the approach that does provide considerable gains and is simple to implement to start with.

What happens if a chunk is corrupt/missing?

When a chunk is deleted from the database, all dependent .narinfo and .nar will become unavailable (503). However, this can be recovered from automatically when any NAR containing the chunk is uploaded.

At the moment, Attic cannot automatically detect when a chunk is corrupt or missing. Correctly distinguishing between transient and persistent failures is difficult. The atticadm utility will have the functionality to kill/delete bad chunks.

Apologies for the big code dump commit - the commit history is pretty chaotic in the private WIP branch. It basically remodelled the entire storage model. ↩
In more concrete terms, I want to use Cloudflare Workers to reassemble NARs for the sweet, sweet free egress 😃 ↩

shimunn · 2023-12-17T13:10:44Z

Wouldn't be simpler to put each store reference into it's own chunk? Doing so would only add complexity to the chunker but not to the assembler.

zhaofengli mentioned this issue Jan 2, 2023

Native Nix store implementation / standardized REST API #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication Strategies #5

Deduplication Strategies #5

blaggacao commented Jan 2, 2023 •

edited

Loading

blaggacao commented Jan 2, 2023 •

edited

Loading

zhaofengli commented Jan 15, 2023

shimunn commented Dec 17, 2023

Deduplication Strategies #5

Deduplication Strategies #5

Comments

blaggacao commented Jan 2, 2023 • edited Loading

blaggacao commented Jan 2, 2023 • edited Loading

zhaofengli commented Jan 15, 2023

Why chunk NARs instead of individual files?

Why not just erase store path references?

What happens if a chunk is corrupt/missing?

Footnotes

shimunn commented Dec 17, 2023

blaggacao commented Jan 2, 2023 •

edited

Loading

blaggacao commented Jan 2, 2023 •

edited

Loading