rfc 0158 - artifact metadata #158

escapewindow · 2020-02-22T00:41:57Z

Fixes #156 .

djmitche

I'm happy with this, but then I was involved in discussing it :)

djmitche · 2020-02-24T17:17:07Z

@ajvb it'd be great to hear from you here, too (I just can't flag you for review)

petemoore

Wasn't this implemented already as the blob storage type?

See https://github.com/taskcluster/taskcluster-lib-artifact-go#overview

escapewindow · 2020-02-25T15:00:31Z

Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there.

imbstack

Sounds great to me!

imbstack · 2020-03-03T23:29:06Z

rfcs/0158-artifact-metadata.md

+
+
+* Queue would have no API to write to this data
+* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update)


I guess this was always technically possible with azure table storage but the fact that we're going to support it now with postgres makes me very happy

escapewindow · 2020-05-01T22:23:39Z

Wasn't this implemented already as the blob storage type?

See https://github.com/taskcluster/taskcluster-lib-artifact-go#overview

Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there.

@petemoore does this answer your question?

petemoore

Yes, thanks Aki for checking.

My recollection is that blob storage was fully implemented, with clients developed for both node.js and go (and possibly even python) - and it also had some nice features such as supporting concurrent multipart uploads, in addition to SHA validation - but for some reason, at some point, the code was removed from the Queue, and thus it is no longer available server-side. I don't have context on why it was removed, but in any case it is likely to have bitrotted by now if we were to add it back. For me, I see it as a shame, it was a good library, and we tests demonstrating that it functioned well. No doubt there is some context I am missing though.

It would be useful to describe in the RFC how backward compatibility would be maintained for existing artifacts (will we retrospectively add SHAs to existing artifacts, or just make them optional?)

Other questions:

* Are there any existing API endpoints that need updating that already provide artifact data?

The rfc just specifies the new Queue.getArtifactInfo currently. I think we will update the S3Artifact request and response schemas to allow for extra optional metadata, but I believe that would work with the existing createArtifact endpoint.

We would essentially need to create a download tool to verify the metadata in the queue matches the artifact we downloaded, and use it everywhere, to fully take advantage of this metadata.

* Is the metadata dictionary string -> string only?

Could be. The sha256/sha512 would be string -> string. The filesize entry could be an int or a stringified int. This line in the rfc currently says filesize is an int.

* Will there be a json schema for the metadata dictionary?

Yes, for all supported metadata. Currently that's just the three entries here, all of them optional; again, I'm thinking we can enforce which ones we set per worker.

Does that make sense?

ajvb

I like this idea and high-level design. I have a lot of questions about the details though. Tried to keep my review/questions on the high-level design piece for now, but will definitely be interested in the implementation details as well.

rfcs/0158-artifact-metadata.md

ajvb · 2020-07-13T20:07:24Z

rfcs/0158-artifact-metadata.md

+
+## Adding artifact metadata to the Queue
+
+First, we add a `metadata` dictionary to the `S3ArtifactRequest` type. This is a dictionary to allow for flexibility of usage. The initial known keys would include


Does Taskcluster handle or proxy in some way this request to S3? Can Taskcluster itself add these metadata fields rather than having them be user provided?

Aiui, Taskcluster never sees the artifact.

https://github.com/taskcluster/taskcluster/blob/4441d319a79a657b392d6c903686f166f3636da1/ui/docs/reference/platform/queue/worker-interaction.mdx#uploading-artifacts

Upload:

worker calls queue.createArtifact to get an s3 putUrl (currently without sha metadata; RFC suggests adding sha metadata)

worker uploads to s3 via the putUrl

Download:

worker calls queue.getArtifact or buildSignedUrl, which redirects or points directly at S3.

(RFC suggests we also query for artifact metadata and verify, via a download tool)

However, we've noted that if we don't trust the worker to generate trusted SHAs, we also can't trust the worker to generate trusted artifacts. So the worker is trusted; the shas mainly prevent modifying artifacts at rest.

Ok, sounds good.

ajvb · 2020-07-13T20:08:01Z

rfcs/0158-artifact-metadata.md

+ContentSha512 string `json:"contentSha512"`
+```
+
+The sha256 field is required for Artifact Integrity. Mozilla Releng has use cases for all 3 fields, so I'm proposing all 3.


Why both ContentSha256 and ContentSha512?

The above line :)

We could do with just ContentSha256. However, when Releng creates our SHA256SUMS and SHA512SUMS files, we have to re-download the files and calculate the shas. Since we have the capability of adding more information here, I thought we might as well support both, and avoid the extra download and sha generation. This is more for efficiency than added security.

It occurs to me that generating 2 shas for all artifacts is likely less efficient than re-downloading and re-generating shas for a small subset of artifacts. We could just use ContentSha256 :)

Ok totally. Yeah, it just seems like a recipe for confusion having both.

Removed. Now that we removed ContentSha512, we likely won't be needing ContentSha256Signature, and ContentFilesize is optional, it does look more and more like we're making S3Artifacts closer to BlobArtifacts. We could flatten this dictionary if we don't foresee needing any more metadata, or keep it if we want the option.

ajvb · 2020-07-13T20:10:41Z

rfcs/0158-artifact-metadata.md

+
+This would improve robustness and security. By storing a SHA to verify on download, we can avoid corrupt downloads. By verifying the read-only SHA before we use the artifact, we can detect malicious tampering of artifacts at rest before we use it for a critical operation.
+
+(This would obsolete the [artifacts section of the Chain of Trust artifact](https://scriptworker.readthedocs.io/en/latest/chain_of_trust.html#chain-of-trust-artifacts), and improve artifact integrity platform-wide.)


Is there a more detailed breakdown of how this provides all of the security features currently provided by CoT?

CoT is mainly three things:

verifying that the artifacts which were uploaded are the same artifacts we're downloading,

verifying that upstream workers are under our control, and

verifying the task graph was generated from a trusted tree.

This RFC directly addresses the first point. We added the artifacts section of the CoT artifact to include the file path and sha256 of each artifact uploaded by each task. Because the CoT artifact itself wasn't covered, we added CoT artifact signatures, which kind of cover point 2 as well as help prove the CoT artifact itself wasn't modified at rest. But we only get a CoT artifact if we enable it in the task, and we only get a verifiable CoT signature on a limited set of workers. And currently only scriptworkers verify CoT.

By moving the artifact metadata into the platform, we have the capability of covering all task artifacts, not just the ones from level 3 trusted pools. And by moving artifact metadata verification out of scriptworker into a standalone download tool, we can use the same tool throughout the graph for full graph artifact integrity.

(If worker [re]registration ends up being trusted enough, this RFC + worker [re]registration could end up obsoleting points 1 and 2 of CoT, and CoT can focus on verifying the graph came from a trusted tree.)

Ok thank you for the detailed breakdown. I'd love to see this included in the RFC so that it can be referred back to later.

Added, though I didn't go as far into detail about worker [re]registration, to keep the RFC more focused on artifact metadata.

ajvb · 2020-07-13T20:12:27Z

rfcs/0158-artifact-metadata.md

+
+This tool queries the Queue for the artifact metadata, downloads the artifact, and verifies any shas. We should allow for optional and required metadata fields, and for failing out if any required information is missing, or if the sha doesn't match. We should be sure to measure the shas and filesizes on the right artifact state (e.g. combining a multipart artifact, not compressed unless the original artifact was compressed).
+
+This tool should be usable as a commandline tool, or as a library that the workers can use.


Will this also be included in Taskcluster Users workflows? i.e. QA downloading an artifact to test?

If so, how will that be "enforced" or "promoted"?

I think we would start with encouraging and recommending the tool, and making it easy to get and use. Once that starts happening, we could update user docs to mention the tool, and remove references to other ways to download things. But our main use case here is protecting the release pipeline: making sure no malicious artifact, worker, or task can result in us shipping something arbitrary. If we end up not enforcing manual use by humans, but those manual tasks never result in our shipping something we didn't expect to ship, we may be ok with that level of risk.

Ok, sounds good.

ajvb

LGTM

I'd like to see the detailed breakdown of how this replaces part of CoT (https://github.com/taskcluster/taskcluster-rfcs/pull/158/files#r454092603) within the RFC so that it is easy to refer back to later

escapewindow · 2020-07-14T18:14:00Z

LGTM

I'd like to see the detailed breakdown of how this replaces part of CoT (https://github.com/taskcluster/taskcluster-rfcs/pull/158/files#r454092603) within the RFC so that it is easy to refer back to later

How's e11edc2 ?

petemoore · 2020-07-15T17:37:54Z

rfcs/0158-artifact-metadata.md

+
+The sha256 field is required for Artifact Integrity. The filesize field is optional.
+
+A future entry may be `ContentSha256WorkerSignature`, if we solve worker identity, though we may abandon this in favor of worker [re]registration.


What does "if we solve worker identity" mean?

This is referring to #157 , which is looking less likely to happen.

petemoore · 2020-07-15T17:41:42Z

rfcs/0158-artifact-metadata.md

+
+This tool should be usable as a commandline tool, or as a library that the workers can use.
+
+Once we implement worker signatures in artifact metadata, the download tool will verify those signatures as well.


What gets signed by the worker? The artifact content + metadata, or just the metadata, or something else?

It would sign the metadata. In this case, just the sha256 hash.

petemoore · 2020-07-15T17:47:50Z

rfcs/0158-artifact-metadata.md

+* Queue would have no API to write to this data
+* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update)
+
+## Providing a download tool


Should there not be an upload tool too, that calculates the hashes, and uploads them?
Would the tool support parallel multipart uploads?
What validates that the sha(s) are correct on upload?

Should there not be an upload tool too, that calculates the hashes, and uploads them?

I was thinking this would be part of the worker. We could certainly provide a shared library or tool for docker-worker, generic-worker, and scriptworker to all use the same code, but I was thinking we would implement in node, go, and python. I'm not sure we need upload capability outside the workers, since we're trying to verify artifacts uploaded by workers, and not grant easy arbitrary uploads to everyone.

Would the tool support parallel multipart uploads?

The workers do, aiui.

What validates that the sha(s) are correct on upload?

The worker is trusted, or we can't trust the artifact itself, let alone its sha.

What I really mean is, is there a validation step after the artifact upload, that the SHA(s) provided in metadata match those of the uploaded file, or does this only happen when the artifact is downloaded?

The advantage I see of validating the SHA(s) post-upload, if possible, is that taskcluster can refuse to store artifacts that claim to have one SHA but actually have a different one. I'm not sure is S3 provides any mechanisms to support SHA validation without the entire artifact needing to be downloaded, but if it does (such as an HTTP header containing the SHA256/SHA512 in the HTTP PUT that uploads the artifact), it would be great to use it. If we would need to download the artifact, perhaps we could limit the check to random spot checks, or to artifacts that are explicitly required to be validated.

Currently we make a taskcluster API call (queue.createArtifact) to request to upload an artifact, then negotiate the artifact upload with S3, but do not perform a second taskcluster API call to confirm successful completion of the upload. I could see a solution whereby after the upload is performed, the queue considers the artifact upload to be in progress, but requires a second API call to confirm that the artifact upload is completed. Addressing this would allow for smarter behaviour in the task inspector and other tools that display artifacts. It would even be reasonable for the second taskcluster API call (e.g. queue.reportArtifactUploadProgress) to have a payload with something like:

{ "uploadedBytes": 1349124, "totalBytes": 344134232, }

so that the endpoint could be called multiple times during artifact upload to report progress on the upload state. If the connection to S3 dropped during upload, and the upload had to start again, this could all be made visible from the task inspector etc.
This last suggestion is a little bit of scope creep, but it feels if we are redesigning the artifact upload and download mechanics, it would be a good time to review whether a second API call would be useful here for reporting progress.
And the first point above is really just seeing if we can find a means to ensure that taskcluster never claims to have artifacts with one SHA but actually store an artifact with a different SHA (for example due to data getting corrupted during transit).

Hm. I could see that being useful for a) saving storage costs for invalid artifacts, b) more end-to-end reliability, c) marking the correct task as failed, and so on. (For (c), without post-upload-verification, the downstream tasks will fail when they try to verify their download; in certain graphs, this could be hundreds of downstream tasks failing. With post-upload-verification, the task that failed to upload properly will fail, which is more accurate.)

Those are all wins. However, aiui, S3 doesn't provide a way to verify checksums unless we downgrade to md5, so we would need to re-download and re-calculate the checksums. That may be costly enough in machine time that we need to determine whether this is worth it.

This RFC seems to be pointing at the question of why blob artifacts were deprecated, and whether it makes more sense to revive them; I'm leaning towards blocking this RFC on that decision.

petemoore · 2020-07-15T18:07:24Z

My recollection is that blob storage was fully implemented

It was partially implemented and never deployed, and had some unaddressed design issues that would have been difficult to address without breaking changes. It remains in version control as a source of inspiration and maybe some code. It was removed from master so that it didn't accidentally come into use, and to minimize the number of migration paths we'll need to support when we build the object service.

@djmitche can you please provide more substantive context on the unaddressed issues? To the best of my knowledge, the following two libraries were implemented and working:

https://github.com/taskcluster/taskcluster-lib-artifact
https://github.com/taskcluster/taskcluster-lib-artifact-go

They interfaced with the server-side implementation, which suggests the server-side implementation was also working. Certainly, it was reviewed and merged to master, and at the time John considered it to be feature complete. I was also successfully using it, and there was a suite of unit and integration tests that suggested it was working and feature complete.

The only missing piece I am aware of is that the workers were not updated to use the new artifact upload/download library. However it sounds like you have insight into issues that I am not aware of, so I am interested to understand those better, especially if addressing those issues is a smaller task than rewriting everything from scratch.

It feels like Aki is going to have a lot of work to do to reimplement everything, so some very specific details about missing functionality would be useful, and a link to the server-side implementation that you deleted may help simplify things for the reimplementation.

@escapewindow Hopefully the links above for the client side libraries and upload/download tools will save you some time and effort too.

escapewindow · 2020-07-15T18:24:25Z

+1. If blob artifacts work, can be rolled out everywhere for FirefoxCI, and we can provide guarantees that the hashes can't be altered post-upload, then it looks like that would be sufficient for my purposes.

escapewindow · 2020-07-23T19:35:32Z

As I understand it:

Current workers don't upload multipart and don't verify that uploads happened without issues. Outside of Chain of Trust and other non-platform-supported options, we don't verify download hashes either.

BlobArtifacts, if fully implemented and rolled out, would speed up uploads by supporting multipart uploading, which may result in run-time cost savings. We would verify uploads at upload time. We could mark the uploading task as failed or exception on upload failure, rather than moving bustage downstream to the downloading task(s). We could also obsolete this RFC, since BlobArtifacts require sha256 hashes for each artifact and provide a download tool.

However, we would need to spend some engineering effort to complete BlobArtifact support. This effort may be higher than adding extra information to the existing S3Artifact type. BlobArtifacts are S3-specific and would take significant engineering effort to support other cloud providers.

We may need to either prioritize implementing BlobArtifacts or the object service. The object service would support multiple cloud platforms, provide cross-region and cross-cloud mirroring, and improve artifact maintenance and culling. These will result in storage and bandwidth cost savings.

escapewindow · 2020-10-15T20:54:49Z

Should we merge this?

djmitche · 2020-10-16T13:16:36Z

Yep, thanks for the bump.

escapewindow requested a review from a team February 22, 2020 00:48

djmitche approved these changes Feb 24, 2020

View reviewed changes

djmitche requested review from helfi92, imbstack, owlishDeveloper and petemoore February 24, 2020 17:16

petemoore reviewed Feb 25, 2020

View reviewed changes

imbstack approved these changes Mar 3, 2020

View reviewed changes

helfi92 approved these changes Mar 4, 2020

View reviewed changes

escapewindow mentioned this pull request Mar 9, 2020

Improve time-to-fail when a task group doesn't pass CoT mozilla-releng/scriptworker#348

Closed

owlishDeveloper approved these changes Apr 14, 2020

View reviewed changes

escapewindow mentioned this pull request May 1, 2020

Private Task Logs #159

Closed

escapewindow requested a review from petemoore May 28, 2020 20:17

petemoore reviewed Jun 4, 2020

View reviewed changes

ajvb reviewed Jul 13, 2020

View reviewed changes

ajvb approved these changes Jul 14, 2020

View reviewed changes

petemoore reviewed Jul 15, 2020

View reviewed changes

ccooper changed the base branch from master to main July 28, 2020 22:08

escapewindow mentioned this pull request Sep 30, 2020

RFC: Worker Machine Image Building Automation #164

Closed

petemoore approved these changes Oct 1, 2020

View reviewed changes

rfc 0158 - artifact metadata

31d7ccf

djmitche added the Phase: Final Comment label Oct 8, 2020

djmitche merged commit f765ac3 into taskcluster:main Oct 16, 2020



		* Queue would have no API to write to this data
		* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update)


		## Adding artifact metadata to the Queue

		First, we add a `metadata` dictionary to the `S3ArtifactRequest` type. This is a dictionary to allow for flexibility of usage. The initial known keys would include


		This would improve robustness and security. By storing a SHA to verify on download, we can avoid corrupt downloads. By verifying the read-only SHA before we use the artifact, we can detect malicious tampering of artifacts at rest before we use it for a critical operation.

		(This would obsolete the [artifacts section of the Chain of Trust artifact](https://scriptworker.readthedocs.io/en/latest/chain_of_trust.html#chain-of-trust-artifacts), and improve artifact integrity platform-wide.)


		This tool queries the Queue for the artifact metadata, downloads the artifact, and verifies any shas. We should allow for optional and required metadata fields, and for failing out if any required information is missing, or if the sha doesn't match. We should be sure to measure the shas and filesizes on the right artifact state (e.g. combining a multipart artifact, not compressed unless the original artifact was compressed).

		This tool should be usable as a commandline tool, or as a library that the workers can use.


		The sha256 field is required for Artifact Integrity. The filesize field is optional.

		A future entry may be `ContentSha256WorkerSignature`, if we solve worker identity, though we may abandon this in favor of worker [re]registration.


		This tool should be usable as a commandline tool, or as a library that the workers can use.

		Once we implement worker signatures in artifact metadata, the download tool will verify those signatures as well.

rfc 0158 - artifact metadata #158

rfc 0158 - artifact metadata #158

Conversation

escapewindow commented Feb 22, 2020

djmitche left a comment

Choose a reason for hiding this comment

djmitche commented Feb 24, 2020

petemoore left a comment

Choose a reason for hiding this comment

escapewindow commented Feb 25, 2020

imbstack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

escapewindow commented May 1, 2020 • edited Loading

petemoore left a comment

Choose a reason for hiding this comment

djmitche commented Jun 4, 2020

escapewindow commented Jun 5, 2020

ajvb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajvb left a comment

Choose a reason for hiding this comment

escapewindow commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petemoore commented Jul 15, 2020

escapewindow commented Jul 15, 2020

escapewindow commented Jul 23, 2020

escapewindow commented Oct 15, 2020

djmitche commented Oct 16, 2020

escapewindow commented May 1, 2020 •

edited

Loading