Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty files and directories should not resolve to a task in serialize_v1 #197

Open
mihaimaruseac opened this issue Jun 5, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@mihaimaruseac
Copy link
Collaborator

Empty files and directories (any directory) have a simple digest: just the hash of the header. Hence, it does not make sense to waste one thread to compute the digest, when it can be computed directly from the start.

This is especially relevant for models that have multiple subdirectories.

@mihaimaruseac mihaimaruseac added the enhancement New feature or request label Jun 5, 2024
@mihaimaruseac mihaimaruseac added this to the V1 release milestone Jun 5, 2024
@mihaimaruseac mihaimaruseac self-assigned this Jun 5, 2024
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 17, 2024
There is a duplication in the directory traversal between
`DFSSerializer` and `FilesSerializer`. Since the later supports parallel
hashing, let's prepare to use only that.

We make `FilesSerializer` be an abstract parent class that performs
the directory traversal. We introduce `ManifestSerializer` for the old
`FilesSerializer` class that was creating a manifest out of the model.
We rename `DFSSerializer` to `DigestSerializer` for consistency.

Since `FilesSerializer` (the directory traversal) only considers files
now, we need to add a `_FileDigestTree` class to transform the list of
files and their hashes (the `FileManifestItem`) to a directory traversal
tree, so we can build the digest for `DigestSerializer` in a bottom-up
fashion, like before. We could have just included only the files,
instead of the directory, but that would require changing a lot of
expected constants in the tests. So, we add this transformation now, we
plan to migrate tests to goldens and then maybe change the hashing to
only include the files when rolling up to a single digest.

We still had to update one test: since the hashes are computed only for
files, we no longer differentiate between a model with an empty
directory and a model where that empty directory is completely removed.
This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted
at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 20, 2024
There is a duplication in the directory traversal between
`DFSSerializer` and `FilesSerializer`. Since the later supports parallel
hashing, let's prepare to use only that.

We make `FilesSerializer` be an abstract parent class that performs
the directory traversal. We introduce `ManifestSerializer` for the old
`FilesSerializer` class that was creating a manifest out of the model.
We rename `DFSSerializer` to `DigestSerializer` for consistency.

Since `FilesSerializer` (the directory traversal) only considers files
now, we need to add a `_FileDigestTree` class to transform the list of
files and their hashes (the `FileManifestItem`) to a directory traversal
tree, so we can build the digest for `DigestSerializer` in a bottom-up
fashion, like before. We could have just included only the files,
instead of the directory, but that would require changing a lot of
expected constants in the tests. So, we add this transformation now, we
plan to migrate tests to goldens and then maybe change the hashing to
only include the files when rolling up to a single digest.

We still had to update one test: since the hashes are computed only for
files, we no longer differentiate between a model with an empty
directory and a model where that empty directory is completely removed.
This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted
at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. This means that this depends on sigstore#244.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit that referenced this issue Jul 22, 2024
* Refactor DFSSerializer to remove code duplication.

There is a duplication in the directory traversal between
`DFSSerializer` and `FilesSerializer`. Since the later supports parallel
hashing, let's prepare to use only that.

We make `FilesSerializer` be an abstract parent class that performs
the directory traversal. We introduce `ManifestSerializer` for the old
`FilesSerializer` class that was creating a manifest out of the model.
We rename `DFSSerializer` to `DigestSerializer` for consistency.

Since `FilesSerializer` (the directory traversal) only considers files
now, we need to add a `_FileDigestTree` class to transform the list of
files and their hashes (the `FileManifestItem`) to a directory traversal
tree, so we can build the digest for `DigestSerializer` in a bottom-up
fashion, like before. We could have just included only the files,
instead of the directory, but that would require changing a lot of
expected constants in the tests. So, we add this transformation now, we
plan to migrate tests to goldens and then maybe change the hashing to
only include the files when rolling up to a single digest.

We still had to update one test: since the hashes are computed only for
files, we no longer differentiate between a model with an empty
directory and a model where that empty directory is completely removed.
This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted
at in #197.

Signed-off-by: Mihai Maruseac <[email protected]>

* Fix Windows

Signed-off-by: Mihai Maruseac <[email protected]>

* Document `__init__`

Signed-off-by: Mihai Maruseac <[email protected]>

* Fix typos, add a type signature

Signed-off-by: Mihai Maruseac <[email protected]>

---------

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 23, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197.

Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac added a commit that referenced this issue Jul 23, 2024
Similar to #241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication.

We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes.

This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens.

We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in #197.

Signed-off-by: Mihai Maruseac <[email protected]>
@mihaimaruseac
Copy link
Collaborator Author

All that's left is to remove the tree from

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant