-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc 0158 - artifact metadata #158
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this, but then I was involved in discussing it :)
@ajvb it'd be great to hear from you here, too (I just can't flag you for review) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't this implemented already as the blob storage type?
See https://github.com/taskcluster/taskcluster-lib-artifact-go#overview
Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great to me!
|
||
|
||
* Queue would have no API to write to this data | ||
* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was always technically possible with azure table storage but the fact that we're going to support it now with postgres makes me very happy
@petemoore does this answer your question? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks Aki for checking.
My recollection is that blob storage was fully implemented, with clients developed for both node.js and go (and possibly even python) - and it also had some nice features such as supporting concurrent multipart uploads, in addition to SHA validation - but for some reason, at some point, the code was removed from the Queue, and thus it is no longer available server-side. I don't have context on why it was removed, but in any case it is likely to have bitrotted by now if we were to add it back. For me, I see it as a shame, it was a good library, and we tests demonstrating that it functioned well. No doubt there is some context I am missing though.
It would be useful to describe in the RFC how backward compatibility would be maintained for existing artifacts (will we retrospectively add SHAs to existing artifacts, or just make them optional?)
Other questions:
- Are there any existing API endpoints that need updating that already provide artifact data?
- Is the metadata dictionary string -> string only?
- Will there be a json schema for the metadata dictionary?
It was partially implemented and never deployed, and had some unaddressed design issues that would have been difficult to address without breaking changes. It remains in version control as a source of inspiration and maybe some code. It was removed from master so that it didn't accidentally come into use, and to minimize the number of migration paths we'll need to support when we build the object service. |
It sounded good, and I would have happily used it if it weren't deprecated. (And if it worked properly :)
I think they should be optional, especially for older artifacts. A cluster-wide setting for new s3artifacts could be useful (
The rfc just specifies the new We would essentially need to create a download tool to verify the metadata in the queue matches the artifact we downloaded, and use it everywhere, to fully take advantage of this metadata.
Could be. The sha256/sha512 would be string -> string. The filesize entry could be an int or a stringified int. This line in the rfc currently says filesize is an int.
Yes, for all supported metadata. Currently that's just the three entries here, all of them optional; again, I'm thinking we can enforce which ones we set per worker. Does that make sense? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea and high-level design. I have a lot of questions about the details though. Tried to keep my review/questions on the high-level design piece for now, but will definitely be interested in the implementation details as well.
rfcs/0158-artifact-metadata.md
Outdated
|
||
## Adding artifact metadata to the Queue | ||
|
||
First, we add a `metadata` dictionary to the `S3ArtifactRequest` type. This is a dictionary to allow for flexibility of usage. The initial known keys would include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Taskcluster handle or proxy in some way this request to S3? Can Taskcluster itself add these metadata fields rather than having them be user provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aiui, Taskcluster never sees the artifact.
Upload:
- worker calls
queue.createArtifact
to get an s3putUrl
(currently without sha metadata; RFC suggests adding sha metadata) - worker uploads to s3 via the
putUrl
Download:
- worker calls
queue.getArtifact
orbuildSignedUrl
, which redirects or points directly at S3. - (RFC suggests we also query for artifact metadata and verify, via a download tool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, we've noted that if we don't trust the worker to generate trusted SHAs, we also can't trust the worker to generate trusted artifacts. So the worker is trusted; the shas mainly prevent modifying artifacts at rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good.
rfcs/0158-artifact-metadata.md
Outdated
ContentSha512 string `json:"contentSha512"` | ||
``` | ||
|
||
The sha256 field is required for Artifact Integrity. Mozilla Releng has use cases for all 3 fields, so I'm proposing all 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why both ContentSha256
and ContentSha512
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above line :)
We could do with just ContentSha256
. However, when Releng creates our SHA256SUMS and SHA512SUMS files, we have to re-download the files and calculate the shas. Since we have the capability of adding more information here, I thought we might as well support both, and avoid the extra download and sha generation. This is more for efficiency than added security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that generating 2 shas for all artifacts is likely less efficient than re-downloading and re-generating shas for a small subset of artifacts. We could just use ContentSha256
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok totally. Yeah, it just seems like a recipe for confusion having both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Now that we removed ContentSha512
, we likely won't be needing ContentSha256Signature
, and ContentFilesize
is optional, it does look more and more like we're making S3Artifact
s closer to BlobArtifacts
. We could flatten this dictionary if we don't foresee needing any more metadata, or keep it if we want the option.
rfcs/0158-artifact-metadata.md
Outdated
|
||
This would improve robustness and security. By storing a SHA to verify on download, we can avoid corrupt downloads. By verifying the read-only SHA before we use the artifact, we can detect malicious tampering of artifacts at rest before we use it for a critical operation. | ||
|
||
(This would obsolete the [artifacts section of the Chain of Trust artifact](https://scriptworker.readthedocs.io/en/latest/chain_of_trust.html#chain-of-trust-artifacts), and improve artifact integrity platform-wide.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more detailed breakdown of how this provides all of the security features currently provided by CoT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CoT is mainly three things:
- verifying that the artifacts which were uploaded are the same artifacts we're downloading,
- verifying that upstream workers are under our control, and
- verifying the task graph was generated from a trusted tree.
This RFC directly addresses the first point. We added the artifacts
section of the CoT artifact to include the file path and sha256 of each artifact uploaded by each task. Because the CoT artifact itself wasn't covered, we added CoT artifact signatures, which kind of cover point 2 as well as help prove the CoT artifact itself wasn't modified at rest. But we only get a CoT artifact if we enable it in the task, and we only get a verifiable CoT signature on a limited set of workers. And currently only scriptworkers verify CoT.
By moving the artifact metadata into the platform, we have the capability of covering all task artifacts, not just the ones from level 3 trusted pools. And by moving artifact metadata verification out of scriptworker into a standalone download tool, we can use the same tool throughout the graph for full graph artifact integrity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If worker [re]registration ends up being trusted enough, this RFC + worker [re]registration could end up obsoleting points 1 and 2 of CoT, and CoT can focus on verifying the graph came from a trusted tree.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thank you for the detailed breakdown. I'd love to see this included in the RFC so that it can be referred back to later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, though I didn't go as far into detail about worker [re]registration, to keep the RFC more focused on artifact metadata.
|
||
This tool queries the Queue for the artifact metadata, downloads the artifact, and verifies any shas. We should allow for optional and required metadata fields, and for failing out if any required information is missing, or if the sha doesn't match. We should be sure to measure the shas and filesizes on the right artifact state (e.g. combining a multipart artifact, not compressed unless the original artifact was compressed). | ||
|
||
This tool should be usable as a commandline tool, or as a library that the workers can use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this also be included in Taskcluster Users workflows? i.e. QA downloading an artifact to test?
If so, how will that be "enforced" or "promoted"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would start with encouraging and recommending the tool, and making it easy to get and use. Once that starts happening, we could update user docs to mention the tool, and remove references to other ways to download things. But our main use case here is protecting the release pipeline: making sure no malicious artifact, worker, or task can result in us shipping something arbitrary. If we end up not enforcing manual use by humans, but those manual tasks never result in our shipping something we didn't expect to ship, we may be ok with that level of risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'd like to see the detailed breakdown of how this replaces part of CoT (https://github.com/taskcluster/taskcluster-rfcs/pull/158/files#r454092603) within the RFC so that it is easy to refer back to later
How's e11edc2 ? |
|
||
The sha256 field is required for Artifact Integrity. The filesize field is optional. | ||
|
||
A future entry may be `ContentSha256WorkerSignature`, if we solve worker identity, though we may abandon this in favor of worker [re]registration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "if we solve worker identity" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is referring to #157 , which is looking less likely to happen.
|
||
This tool should be usable as a commandline tool, or as a library that the workers can use. | ||
|
||
Once we implement worker signatures in artifact metadata, the download tool will verify those signatures as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What gets signed by the worker? The artifact content + metadata, or just the metadata, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would sign the metadata. In this case, just the sha256 hash.
* Queue would have no API to write to this data | ||
* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update) | ||
|
||
## Providing a download tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there not be an upload tool too, that calculates the hashes, and uploads them?
Would the tool support parallel multipart uploads?
What validates that the sha(s) are correct on upload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there not be an upload tool too, that calculates the hashes, and uploads them?
I was thinking this would be part of the worker. We could certainly provide a shared library or tool for docker-worker, generic-worker, and scriptworker to all use the same code, but I was thinking we would implement in node, go, and python. I'm not sure we need upload capability outside the workers, since we're trying to verify artifacts uploaded by workers, and not grant easy arbitrary uploads to everyone.
Would the tool support parallel multipart uploads?
The workers do, aiui.
What validates that the sha(s) are correct on upload?
The worker is trusted, or we can't trust the artifact itself, let alone its sha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I really mean is, is there a validation step after the artifact upload, that the SHA(s) provided in metadata match those of the uploaded file, or does this only happen when the artifact is downloaded?
The advantage I see of validating the SHA(s) post-upload, if possible, is that taskcluster can refuse to store artifacts that claim to have one SHA but actually have a different one. I'm not sure is S3 provides any mechanisms to support SHA validation without the entire artifact needing to be downloaded, but if it does (such as an HTTP header containing the SHA256/SHA512 in the HTTP PUT that uploads the artifact), it would be great to use it. If we would need to download the artifact, perhaps we could limit the check to random spot checks, or to artifacts that are explicitly required to be validated.
Currently we make a taskcluster API call (queue.createArtifact
) to request to upload an artifact, then negotiate the artifact upload with S3, but do not perform a second taskcluster API call to confirm successful completion of the upload. I could see a solution whereby after the upload is performed, the queue considers the artifact upload to be in progress, but requires a second API call to confirm that the artifact upload is completed. Addressing this would allow for smarter behaviour in the task inspector and other tools that display artifacts. It would even be reasonable for the second taskcluster API call (e.g. queue.reportArtifactUploadProgress
) to have a payload with something like:
{
"uploadedBytes": 1349124,
"totalBytes": 344134232,
}
so that the endpoint could be called multiple times during artifact upload to report progress on the upload state. If the connection to S3 dropped during upload, and the upload had to start again, this could all be made visible from the task inspector etc.
This last suggestion is a little bit of scope creep, but it feels if we are redesigning the artifact upload and download mechanics, it would be a good time to review whether a second API call would be useful here for reporting progress.
And the first point above is really just seeing if we can find a means to ensure that taskcluster never claims to have artifacts with one SHA but actually store an artifact with a different SHA (for example due to data getting corrupted during transit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I could see that being useful for a) saving storage costs for invalid artifacts, b) more end-to-end reliability, c) marking the correct task as failed, and so on. (For (c), without post-upload-verification, the downstream tasks will fail when they try to verify their download; in certain graphs, this could be hundreds of downstream tasks failing. With post-upload-verification, the task that failed to upload properly will fail, which is more accurate.)
Those are all wins. However, aiui, S3 doesn't provide a way to verify checksums unless we downgrade to md5, so we would need to re-download and re-calculate the checksums. That may be costly enough in machine time that we need to determine whether this is worth it.
This RFC seems to be pointing at the question of why blob artifacts were deprecated, and whether it makes more sense to revive them; I'm leaning towards blocking this RFC on that decision.
@djmitche can you please provide more substantive context on the unaddressed issues? To the best of my knowledge, the following two libraries were implemented and working: https://github.com/taskcluster/taskcluster-lib-artifact They interfaced with the server-side implementation, which suggests the server-side implementation was also working. Certainly, it was reviewed and merged to master, and at the time John considered it to be feature complete. I was also successfully using it, and there was a suite of unit and integration tests that suggested it was working and feature complete. The only missing piece I am aware of is that the workers were not updated to use the new artifact upload/download library. However it sounds like you have insight into issues that I am not aware of, so I am interested to understand those better, especially if addressing those issues is a smaller task than rewriting everything from scratch. It feels like Aki is going to have a lot of work to do to reimplement everything, so some very specific details about missing functionality would be useful, and a link to the server-side implementation that you deleted may help simplify things for the reimplementation. @escapewindow Hopefully the links above for the client side libraries and upload/download tools will save you some time and effort too. |
+1. If blob artifacts work, can be rolled out everywhere for FirefoxCI, and we can provide guarantees that the hashes can't be altered post-upload, then it looks like that would be sufficient for my purposes. |
As I understand it: Current workers don't upload multipart and don't verify that uploads happened without issues. Outside of Chain of Trust and other non-platform-supported options, we don't verify download hashes either.
However, we would need to spend some engineering effort to complete We may need to either prioritize implementing |
Should we merge this? |
Yep, thanks for the bump. |
Fixes #156 .