-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication Strategies #5
Comments
Further references The For interested parties, here is the introduction blog post for And the really deep dive is here: https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/ Stripping of store references has also been discussed in the context of And the introductory blog post is also a good entrypoint from a Nix perspective. |
FastCDC-based chunking has been added in e8f9f3c 1. In this new model, NARs are backed by a sequence of content-addressed chunks in the Global Chunk Store. Newly-uploaded NARs will be split into chunks with FastCDC and only new chunks will be uploaded to the storage backend. NARs that have existed prior to chunking will be converted to have a single chunk. For example, this works reasonably well for huge unfree paths that are rebuilt without version change (e.g., Zoom and VSCode), even with a very large chunk size. I have some simple numbers on here and will post more later. It also works even if there are some differences (e.g., I'm leaving this issue open so other approaches can be explored. Some relevant FAQs: Why chunk NARs instead of individual files?In the current design, chunking is applied to the entire uncompressed NAR file instead of individual constituent files in the NAR. Big NARs that benefit the most from chunk-based deduplication (e.g., VSCode, Zoom) often have hundreds or thousands of small files. During NAR reassembly, it's often uneconomical or impractical to fetch thousands of files to reconstruct the NAR in a scalable way. By chunking the entire NAR, it's possible to configure the average chunk size to a larger value, ignoring file boundaries and lumping small files together. This is also the approach You may have heard that the Tvix store protocol chunks individual files instead of the NAR. The design of Attic is driven by the desire to effectively utilize existing platforms with practical limitations2, while looking forward to the future. Why not just erase store path references?At first glance, erasing store path references and storing them separately seems easy but it's actually difficult to do in practice. It makes NAR assembly difficult (not a simple concatenation of chunks anymore) and the immediate benefits from doing so alone are minimal. Many files have other differences, like a minor version upgrade or What happens if a chunk is corrupt/missing?When a chunk is deleted from the database, all dependent At the moment, Attic cannot automatically detect when a chunk is corrupt or missing. Correctly distinguishing between transient and persistent failures is difficult. The Footnotes |
Wouldn't be simpler to put each store reference into it's own chunk? Doing so would only add complexity to the chunker but not to the assembler. |
Reference: NixOS/nixpkgs#89380
As proposed there by @wamserma and also recently by @wmertens in discourse.
Edit: further references of this idea in next comment down below.
Before
casync
or any generic dedup strategy... we can likely exploit the (presumed & yet unmetered) property that a lot of re -builds may not actually change a lot of bytes, but the store path references to other dependencies, and thus creating an almost identical new store entry.
Possible Solution
When zeroing (
/nix/store/0000000...
) these store references creates identical artifacts, we have a potent initial and cheap deduplication exclusive to store based systems.The true references would need storing via some metadata schema and cheap substitution on-the-fly when serving a given concrete store entry.
Quantification Necessary
The purpose of this issue is to anchor this strategy in the context of attic, since the roadmap states potential future use of deduplication strategies, specifically mentioning
casync
.Unfortunately, I can't provide any estimates or quantification of the benefits other than the qualitative argument that this can be an interesting and cheap approach to explore before more advanced dedup strategies may be considered.
Still, I think it is worth mentioning in this context.
The text was updated successfully, but these errors were encountered: