Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc 0158 - artifact metadata #158

Merged
merged 1 commit into from
Oct 16, 2020
Merged

rfc 0158 - artifact metadata #158

merged 1 commit into from
Oct 16, 2020

Conversation

escapewindow
Copy link
Contributor

Fixes #156 .

@escapewindow escapewindow requested a review from a team February 22, 2020 00:48
Copy link
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, but then I was involved in discussing it :)

@djmitche
Copy link
Contributor

@ajvb it'd be great to hear from you here, too (I just can't flag you for review)

Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this implemented already as the blob storage type?

See https://github.com/taskcluster/taskcluster-lib-artifact-go#overview

@escapewindow
Copy link
Contributor Author

Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there.

Copy link
Contributor

@imbstack imbstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great to me!



* Queue would have no API to write to this data
* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was always technically possible with azure table storage but the fact that we're going to support it now with postgres makes me very happy

@escapewindow escapewindow mentioned this pull request May 1, 2020
@escapewindow
Copy link
Contributor Author

escapewindow commented May 1, 2020

Wasn't this implemented already as the blob storage type?

See https://github.com/taskcluster/taskcluster-lib-artifact-go#overview

Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there.

@petemoore does this answer your question?

Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks Aki for checking.

My recollection is that blob storage was fully implemented, with clients developed for both node.js and go (and possibly even python) - and it also had some nice features such as supporting concurrent multipart uploads, in addition to SHA validation - but for some reason, at some point, the code was removed from the Queue, and thus it is no longer available server-side. I don't have context on why it was removed, but in any case it is likely to have bitrotted by now if we were to add it back. For me, I see it as a shame, it was a good library, and we tests demonstrating that it functioned well. No doubt there is some context I am missing though.

It would be useful to describe in the RFC how backward compatibility would be maintained for existing artifacts (will we retrospectively add SHAs to existing artifacts, or just make them optional?)

Other questions:

  • Are there any existing API endpoints that need updating that already provide artifact data?
  • Is the metadata dictionary string -> string only?
  • Will there be a json schema for the metadata dictionary?

@djmitche
Copy link
Contributor

djmitche commented Jun 4, 2020

My recollection is that blob storage was fully implemented

It was partially implemented and never deployed, and had some unaddressed design issues that would have been difficult to address without breaking changes. It remains in version control as a source of inspiration and maybe some code. It was removed from master so that it didn't accidentally come into use, and to minimize the number of migration paths we'll need to support when we build the object service.

@escapewindow
Copy link
Contributor Author

My recollection is that blob storage was fully implemented, with clients developed for both node.js and go (and possibly even python) - and it also had some nice features such as supporting concurrent multipart uploads, in addition to SHA validation - but for some reason, at some point, the code was removed from the Queue, and thus it is no longer available server-side. I don't have context on why it was removed, but in any case it is likely to have bitrotted by now if we were to add it back. For me, I see it as a shame, it was a good library, and we tests demonstrating that it functioned well. No doubt there is some context I am missing though.

It sounded good, and I would have happily used it if it weren't deprecated. (And if it worked properly :)

It would be useful to describe in the RFC how backward compatibility would be maintained for existing artifacts (will we retrospectively add SHAs to existing artifacts, or just make them optional?)

I think they should be optional, especially for older artifacts. A cluster-wide setting for new s3artifacts could be useful (requireS3ArtifactMetadata ?). A worker setting may be sufficient here. We could set the worker setting on all firefoxci cluster worker pools, and possibly enforce it via ci-admin checks.

Other questions:

* Are there any existing API endpoints that need updating that already provide artifact data?

The rfc just specifies the new Queue.getArtifactInfo currently. I think we will update the S3Artifact request and response schemas to allow for extra optional metadata, but I believe that would work with the existing createArtifact endpoint.

We would essentially need to create a download tool to verify the metadata in the queue matches the artifact we downloaded, and use it everywhere, to fully take advantage of this metadata.

* Is the metadata dictionary string -> string only?

Could be. The sha256/sha512 would be string -> string. The filesize entry could be an int or a stringified int. This line in the rfc currently says filesize is an int.

* Will there be a json schema for the metadata dictionary?

Yes, for all supported metadata. Currently that's just the three entries here, all of them optional; again, I'm thinking we can enforce which ones we set per worker.

Does that make sense?

Copy link

@ajvb ajvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea and high-level design. I have a lot of questions about the details though. Tried to keep my review/questions on the high-level design piece for now, but will definitely be interested in the implementation details as well.

rfcs/0158-artifact-metadata.md Outdated Show resolved Hide resolved

## Adding artifact metadata to the Queue

First, we add a `metadata` dictionary to the `S3ArtifactRequest` type. This is a dictionary to allow for flexibility of usage. The initial known keys would include
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Taskcluster handle or proxy in some way this request to S3? Can Taskcluster itself add these metadata fields rather than having them be user provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aiui, Taskcluster never sees the artifact.

https://github.com/taskcluster/taskcluster/blob/4441d319a79a657b392d6c903686f166f3636da1/ui/docs/reference/platform/queue/worker-interaction.mdx#uploading-artifacts

Upload:

  • worker calls queue.createArtifact to get an s3 putUrl (currently without sha metadata; RFC suggests adding sha metadata)
  • worker uploads to s3 via the putUrl

Download:

  • worker calls queue.getArtifact or buildSignedUrl, which redirects or points directly at S3.
  • (RFC suggests we also query for artifact metadata and verify, via a download tool)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we've noted that if we don't trust the worker to generate trusted SHAs, we also can't trust the worker to generate trusted artifacts. So the worker is trusted; the shas mainly prevent modifying artifacts at rest.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good.

ContentSha512 string `json:"contentSha512"`
```

The sha256 field is required for Artifact Integrity. Mozilla Releng has use cases for all 3 fields, so I'm proposing all 3.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why both ContentSha256 and ContentSha512?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above line :)

We could do with just ContentSha256. However, when Releng creates our SHA256SUMS and SHA512SUMS files, we have to re-download the files and calculate the shas. Since we have the capability of adding more information here, I thought we might as well support both, and avoid the extra download and sha generation. This is more for efficiency than added security.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that generating 2 shas for all artifacts is likely less efficient than re-downloading and re-generating shas for a small subset of artifacts. We could just use ContentSha256 :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok totally. Yeah, it just seems like a recipe for confusion having both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Now that we removed ContentSha512, we likely won't be needing ContentSha256Signature, and ContentFilesize is optional, it does look more and more like we're making S3Artifacts closer to BlobArtifacts. We could flatten this dictionary if we don't foresee needing any more metadata, or keep it if we want the option.


This would improve robustness and security. By storing a SHA to verify on download, we can avoid corrupt downloads. By verifying the read-only SHA before we use the artifact, we can detect malicious tampering of artifacts at rest before we use it for a critical operation.

(This would obsolete the [artifacts section of the Chain of Trust artifact](https://scriptworker.readthedocs.io/en/latest/chain_of_trust.html#chain-of-trust-artifacts), and improve artifact integrity platform-wide.)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more detailed breakdown of how this provides all of the security features currently provided by CoT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoT is mainly three things:

  1. verifying that the artifacts which were uploaded are the same artifacts we're downloading,
  2. verifying that upstream workers are under our control, and
  3. verifying the task graph was generated from a trusted tree.

This RFC directly addresses the first point. We added the artifacts section of the CoT artifact to include the file path and sha256 of each artifact uploaded by each task. Because the CoT artifact itself wasn't covered, we added CoT artifact signatures, which kind of cover point 2 as well as help prove the CoT artifact itself wasn't modified at rest. But we only get a CoT artifact if we enable it in the task, and we only get a verifiable CoT signature on a limited set of workers. And currently only scriptworkers verify CoT.

By moving the artifact metadata into the platform, we have the capability of covering all task artifacts, not just the ones from level 3 trusted pools. And by moving artifact metadata verification out of scriptworker into a standalone download tool, we can use the same tool throughout the graph for full graph artifact integrity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If worker [re]registration ends up being trusted enough, this RFC + worker [re]registration could end up obsoleting points 1 and 2 of CoT, and CoT can focus on verifying the graph came from a trusted tree.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you for the detailed breakdown. I'd love to see this included in the RFC so that it can be referred back to later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, though I didn't go as far into detail about worker [re]registration, to keep the RFC more focused on artifact metadata.


This tool queries the Queue for the artifact metadata, downloads the artifact, and verifies any shas. We should allow for optional and required metadata fields, and for failing out if any required information is missing, or if the sha doesn't match. We should be sure to measure the shas and filesizes on the right artifact state (e.g. combining a multipart artifact, not compressed unless the original artifact was compressed).

This tool should be usable as a commandline tool, or as a library that the workers can use.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also be included in Taskcluster Users workflows? i.e. QA downloading an artifact to test?

If so, how will that be "enforced" or "promoted"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would start with encouraging and recommending the tool, and making it easy to get and use. Once that starts happening, we could update user docs to mention the tool, and remove references to other ways to download things. But our main use case here is protecting the release pipeline: making sure no malicious artifact, worker, or task can result in us shipping something arbitrary. If we end up not enforcing manual use by humans, but those manual tasks never result in our shipping something we didn't expect to ship, we may be ok with that level of risk.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good.

Copy link

@ajvb ajvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'd like to see the detailed breakdown of how this replaces part of CoT (https://github.com/taskcluster/taskcluster-rfcs/pull/158/files#r454092603) within the RFC so that it is easy to refer back to later

@escapewindow
Copy link
Contributor Author

LGTM

I'd like to see the detailed breakdown of how this replaces part of CoT (https://github.com/taskcluster/taskcluster-rfcs/pull/158/files#r454092603) within the RFC so that it is easy to refer back to later

How's e11edc2 ?


The sha256 field is required for Artifact Integrity. The filesize field is optional.

A future entry may be `ContentSha256WorkerSignature`, if we solve worker identity, though we may abandon this in favor of worker [re]registration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "if we solve worker identity" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to #157 , which is looking less likely to happen.


This tool should be usable as a commandline tool, or as a library that the workers can use.

Once we implement worker signatures in artifact metadata, the download tool will verify those signatures as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What gets signed by the worker? The artifact content + metadata, or just the metadata, or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would sign the metadata. In this case, just the sha256 hash.

* Queue would have no API to write to this data
* The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update)

## Providing a download tool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there not be an upload tool too, that calculates the hashes, and uploads them?
Would the tool support parallel multipart uploads?
What validates that the sha(s) are correct on upload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there not be an upload tool too, that calculates the hashes, and uploads them?

I was thinking this would be part of the worker. We could certainly provide a shared library or tool for docker-worker, generic-worker, and scriptworker to all use the same code, but I was thinking we would implement in node, go, and python. I'm not sure we need upload capability outside the workers, since we're trying to verify artifacts uploaded by workers, and not grant easy arbitrary uploads to everyone.

Would the tool support parallel multipart uploads?

The workers do, aiui.

What validates that the sha(s) are correct on upload?

The worker is trusted, or we can't trust the artifact itself, let alone its sha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really mean is, is there a validation step after the artifact upload, that the SHA(s) provided in metadata match those of the uploaded file, or does this only happen when the artifact is downloaded?

The advantage I see of validating the SHA(s) post-upload, if possible, is that taskcluster can refuse to store artifacts that claim to have one SHA but actually have a different one. I'm not sure is S3 provides any mechanisms to support SHA validation without the entire artifact needing to be downloaded, but if it does (such as an HTTP header containing the SHA256/SHA512 in the HTTP PUT that uploads the artifact), it would be great to use it. If we would need to download the artifact, perhaps we could limit the check to random spot checks, or to artifacts that are explicitly required to be validated.

Currently we make a taskcluster API call (queue.createArtifact) to request to upload an artifact, then negotiate the artifact upload with S3, but do not perform a second taskcluster API call to confirm successful completion of the upload. I could see a solution whereby after the upload is performed, the queue considers the artifact upload to be in progress, but requires a second API call to confirm that the artifact upload is completed. Addressing this would allow for smarter behaviour in the task inspector and other tools that display artifacts. It would even be reasonable for the second taskcluster API call (e.g. queue.reportArtifactUploadProgress) to have a payload with something like:

{
  "uploadedBytes": 1349124,
  "totalBytes": 344134232,
}

so that the endpoint could be called multiple times during artifact upload to report progress on the upload state. If the connection to S3 dropped during upload, and the upload had to start again, this could all be made visible from the task inspector etc.
This last suggestion is a little bit of scope creep, but it feels if we are redesigning the artifact upload and download mechanics, it would be a good time to review whether a second API call would be useful here for reporting progress.
And the first point above is really just seeing if we can find a means to ensure that taskcluster never claims to have artifacts with one SHA but actually store an artifact with a different SHA (for example due to data getting corrupted during transit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I could see that being useful for a) saving storage costs for invalid artifacts, b) more end-to-end reliability, c) marking the correct task as failed, and so on. (For (c), without post-upload-verification, the downstream tasks will fail when they try to verify their download; in certain graphs, this could be hundreds of downstream tasks failing. With post-upload-verification, the task that failed to upload properly will fail, which is more accurate.)

Those are all wins. However, aiui, S3 doesn't provide a way to verify checksums unless we downgrade to md5, so we would need to re-download and re-calculate the checksums. That may be costly enough in machine time that we need to determine whether this is worth it.

This RFC seems to be pointing at the question of why blob artifacts were deprecated, and whether it makes more sense to revive them; I'm leaning towards blocking this RFC on that decision.

@petemoore
Copy link
Member

My recollection is that blob storage was fully implemented

It was partially implemented and never deployed, and had some unaddressed design issues that would have been difficult to address without breaking changes. It remains in version control as a source of inspiration and maybe some code. It was removed from master so that it didn't accidentally come into use, and to minimize the number of migration paths we'll need to support when we build the object service.

@djmitche can you please provide more substantive context on the unaddressed issues? To the best of my knowledge, the following two libraries were implemented and working:

https://github.com/taskcluster/taskcluster-lib-artifact
https://github.com/taskcluster/taskcluster-lib-artifact-go

They interfaced with the server-side implementation, which suggests the server-side implementation was also working. Certainly, it was reviewed and merged to master, and at the time John considered it to be feature complete. I was also successfully using it, and there was a suite of unit and integration tests that suggested it was working and feature complete.

The only missing piece I am aware of is that the workers were not updated to use the new artifact upload/download library. However it sounds like you have insight into issues that I am not aware of, so I am interested to understand those better, especially if addressing those issues is a smaller task than rewriting everything from scratch.

It feels like Aki is going to have a lot of work to do to reimplement everything, so some very specific details about missing functionality would be useful, and a link to the server-side implementation that you deleted may help simplify things for the reimplementation.

@escapewindow Hopefully the links above for the client side libraries and upload/download tools will save you some time and effort too.

@escapewindow
Copy link
Contributor Author

+1. If blob artifacts work, can be rolled out everywhere for FirefoxCI, and we can provide guarantees that the hashes can't be altered post-upload, then it looks like that would be sufficient for my purposes.

@escapewindow
Copy link
Contributor Author

As I understand it:

Current workers don't upload multipart and don't verify that uploads happened without issues. Outside of Chain of Trust and other non-platform-supported options, we don't verify download hashes either.

BlobArtifacts, if fully implemented and rolled out, would speed up uploads by supporting multipart uploading, which may result in run-time cost savings. We would verify uploads at upload time. We could mark the uploading task as failed or exception on upload failure, rather than moving bustage downstream to the downloading task(s). We could also obsolete this RFC, since BlobArtifacts require sha256 hashes for each artifact and provide a download tool.

However, we would need to spend some engineering effort to complete BlobArtifact support. This effort may be higher than adding extra information to the existing S3Artifact type. BlobArtifacts are S3-specific and would take significant engineering effort to support other cloud providers.

We may need to either prioritize implementing BlobArtifacts or the object service. The object service would support multiple cloud platforms, provide cross-region and cross-cloud mirroring, and improve artifact maintenance and culling. These will result in storage and bandwidth cost savings.

@ccooper ccooper changed the base branch from master to main July 28, 2020 22:08
@escapewindow
Copy link
Contributor Author

Should we merge this?

@djmitche
Copy link
Contributor

Yep, thanks for the bump.

@djmitche djmitche merged commit f765ac3 into taskcluster:main Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

artifact metadata
7 participants