Refactor DFSSerializer to remove code duplication. #241

mihaimaruseac · 2024-07-17T03:23:21Z

Summary

There is a duplication in the directory traversal between DFSSerializer and FilesSerializer. Since the later supports parallel hashing, let's prepare to use only that.

We make FilesSerializer be an abstract parent class that performs the directory traversal. We introduce ManifestSerializer for the old FilesSerializer class that was creating a manifest out of the model. We rename DFSSerializer to DigestSerializer for consistency.

Since FilesSerializer (the directory traversal) only considers files now, we need to add a _FileDigestTree class to transform the list of files and their hashes (the FileManifestItem) to a directory traversal tree, so we can build the digest for DigestSerializer in a bottom-up fashion, like before. We could have just included only the files, instead of the directory, but that would require changing a lot of expected constants in the tests. So, we add this transformation now, we plan to migrate tests to goldens and then maybe change the hashing to only include the files when rolling up to a single digest.

We still had to update one test: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this.

In fact, ignoring empty directories is part of the optimization hinted at in #197.

Release Note

NONE

Documentation

NONE

There is a duplication in the directory traversal between `DFSSerializer` and `FilesSerializer`. Since the later supports parallel hashing, let's prepare to use only that. We make `FilesSerializer` be an abstract parent class that performs the directory traversal. We introduce `ManifestSerializer` for the old `FilesSerializer` class that was creating a manifest out of the model. We rename `DFSSerializer` to `DigestSerializer` for consistency. Since `FilesSerializer` (the directory traversal) only considers files now, we need to add a `_FileDigestTree` class to transform the list of files and their hashes (the `FileManifestItem`) to a directory traversal tree, so we can build the digest for `DigestSerializer` in a bottom-up fashion, like before. We could have just included only the files, instead of the directory, but that would require changing a lot of expected constants in the tests. So, we add this transformation now, we plan to migrate tests to goldens and then maybe change the hashing to only include the files when rolling up to a single digest. We still had to update one test: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>

Signed-off-by: Mihai Maruseac <[email protected]>

Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. This means that this depends on sigstore#244. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>

Similar to sigstore#241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in sigstore#197. Signed-off-by: Mihai Maruseac <[email protected]>

Similar to #241, there is a duplication in the directory traversal between serializing to a digest and serializing to a manifest. This time, both supported parallelism, so there is really no need for the duplication. We make an abstract `ShardedFilesSerializer` class to contain the logic for the directory traversal and then create the better named `DigestSerializer` and `ManifestSerializer` for the two serializing classes. This time, instead of trying extremely hard to match the old behavior for digest serialization, we just update the goldens. We still had to update some other tests: since the hashes are computed only for files, we no longer differentiate between a model with an empty directory and a model where that empty directory is completely removed. This is a corner case and it is ok to do this. In fact, ignoring empty directories is part of the optimization hinted at in #197. Signed-off-by: Mihai Maruseac <[email protected]>

mihaimaruseac marked this pull request as ready for review July 17, 2024 04:48

mihaimaruseac requested review from a team as code owners July 17, 2024 04:48

mihaimaruseac added this to the V1 release milestone Jul 19, 2024

mihaimaruseac added 4 commits July 20, 2024 16:57

Fix Windows

53ce736

Signed-off-by: Mihai Maruseac <[email protected]>

Document __init__

7e0d225

Signed-off-by: Mihai Maruseac <[email protected]>

Fix typos, add a type signature

0a06edf

Signed-off-by: Mihai Maruseac <[email protected]>

mihaimaruseac force-pushed the refactor_file_dfs branch from 80db6ef to 0a06edf Compare July 20, 2024 23:57

mihaimaruseac mentioned this pull request Jul 22, 2024

Refactor sharded serialization to remove code duplication #245

Merged

spencerschrock approved these changes Jul 22, 2024

View reviewed changes

mihaimaruseac merged commit 99f4d8a into sigstore:main Jul 22, 2024
20 checks passed

mihaimaruseac deleted the refactor_file_dfs branch July 22, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor DFSSerializer to remove code duplication. #241

Refactor DFSSerializer to remove code duplication. #241

mihaimaruseac commented Jul 17, 2024

Refactor DFSSerializer to remove code duplication. #241

Refactor DFSSerializer to remove code duplication. #241

Conversation

mihaimaruseac commented Jul 17, 2024

Summary

Release Note

Documentation