-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor DFSSerializer to remove code duplication. #241
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There is a duplication in the directory traversal between `DFSSerializer` and `FilesSerializer`. Since the later supports parallel hashing, let's prepare to use only that. We make `FilesSerializer` be an abstract parent class that performs the directory traversal. We introduce `ManifestSerializer` for the old `FilesSerializer` class that was creating a manifest out of the model. We rename `DFSSerializer` to `DigestSerializer` for consistency. Since `FilesSerializer` (the directory traversal) only considers files now, we need to add a `_FileDigestTree` class to transform the list of files and their hashes (the `FileManifestItem`) to a directory traversal tree, so we can build the digest for `DigestSerializer` in a bottom-up fashion, like before. We could have just included only the files, instead of the directory, but that would require changing a lot of expected constants in the tests. So, we add this transformation now, we plan to migrate tests to goldens and then maybe change the hashing to only include the files when rolling up to a single digest. We still had to update one test: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
force-pushed
the
refactor_file_dfs
branch
from
July 20, 2024 23:57
80db6ef
to
0a06edf
Compare
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. This means that this depends on sigstore#244. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
spencerschrock
approved these changes
Jul 22, 2024
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 22, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
to mihaimaruseac/model-transparency
that referenced
this pull request
Jul 23, 2024
Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>
mihaimaruseac
added a commit
that referenced
this pull request
Jul 23, 2024
Similar to #241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in #197. Signed-off-by: Mihai Maruseac <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
There is a duplication in the directory traversal between
DFSSerializer
andFilesSerializer
. Since the later supports parallel hashing, let's prepare to use only that.We make
FilesSerializer
be an abstract parent class that performs the directory traversal. We introduceManifestSerializer
for the oldFilesSerializer
class that was creating a manifest out of the model. We renameDFSSerializer
toDigestSerializer
for consistency.Since
FilesSerializer
(the directory traversal) only considers files now, we need to add a_FileDigestTree
class to transform the list of files and their hashes (theFileManifestItem
) to a directory traversal tree, so we can build the digest forDigestSerializer
in a bottom-up fashion, like before. We could have just included only the files, instead of the directory, but that would require changing a lot of expected constants in the tests. So, we add this transformation now, we plan to migrate tests to goldens and then maybe change the hashing to only include the files when rolling up to a single digest.We still had to update one test: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.
In fact, ignoring empty directories is part of the optimization hinted at in #197.
Release Note
NONE
Documentation
NONE