Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

Strip nix store references from chunks and store them in metadata instead #37

Open
Atemu opened this issue Jan 18, 2022 · 5 comments
Open

Comments

@Atemu
Copy link

Atemu commented Jan 18, 2022

This is an optimisation idea I had while reading your blog post:

Strip all Nix store references from the actual chunks and put them into per-path metadata instead.

This way, two paths made from the same derivation that are distinct because of an inconsequential stdenv rebuild but (except for the references) hold exactly the same data, would still dedup against each other.

How this could look like:

/nix/store/aa...-test:

This is a file that references /nix/store/oo...-foo and /nix/store/OO...-bar

becomes chunk C:

This is a file that references /nix/store/00... and /nix/store/00...

and aa....meta now contains:

Chunks:
C

References:
/nix/store/oo...-foo
/nix/store/OO...-bar

which it can then use to recreate the store path, substituting the references in sequential order into the placeholders.

If there was an inconsequential stdenv rebuild (a comment in a string somewhere for example), the same derivation might evaluate to /nix/store/bb...-test:

This is a file that references /nix/store/dd...-foo and /nix/store/DD...-bar

but the metadata would look like this:

Chunks:
C

References:
/nix/store/dd...-foo
/nix/store/DD...-bar

As you can see, chunk C is reused.

@ajs124
Copy link
Contributor

ajs124 commented Jan 18, 2022

iirc this is exactly what people are trying to implement in nix itself with CA derivations

@Atemu
Copy link
Author

Atemu commented Jan 18, 2022

I know much of this is similar but I don't think they do any chunking as nix-casync does, do they?

@ajs124
Copy link
Contributor

ajs124 commented Jan 18, 2022

Right, yes. The "This way, two paths made from the same derivation that are distinct because of an inconsequential stdenv rebuild but (except for the references) hold exactly the same data, would still dedup against each other." is exactly the same.

So it would still make sense to upload CA storepaths into nix casync, because of the chunking, but the whole reference rewriting is probably better implemented in nix and is supposed to be part of nix 4.0.

@Atemu
Copy link
Author

Atemu commented Jan 18, 2022

Oh yeah, absolutely, Nix is the proper place to implement all of this. I'm pretty sure the purpose of this repo is to show off a PoC, rather than something that's actually usable.

@flokli
Copy link
Owner

flokli commented Jan 19, 2022

This was also something that was discussed.

There's already plans to add a reference scanner while ingesting .nar files, and replacing those with some placeholders before feeding to the chunker could provide more deduplication benefits. With this, we'd be able to deduplicate a block containing a lot of strings with only differing store paths.

This is something I'd also like to test out with a large dataset (#2).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants