Worker Identity and the Worker Key #157

escapewindow · 2020-02-05T22:43:10Z

(This is related to #156, but probably needs a few more questions answered.)

I can open an RFC once we have an initial consensus.

The goal is to provide an Artifact Integrity guarantee that a given artifact was generated by a worker under our control.

In this model, the worker manager will provide a key for each provisioned worker.

Worker Manager provides a key to provisioned workers, using cloud provider instance identity
Worker Manager allows for key generation for hardware workers, with documentation on how to protect this key
Worker Manager provides an endpoint to query the public key for a given worker
Worker Manager allows for keeping important worker history, until the artifacts uploaded by those workers expire
Workers use this key to sign the sha256 of the artifact, and submit that signature along with the other artifact metadata.

Keypair

We've gone back and forth between PKI and no PKI. In the PKI model, we would have an intermediate cert on the Worker Manager, and sign the worker cert. We would trust the root cert and verify signatures through the web of trust. This brings up questions around key rotation and revocation that we should address if we go this route.

In the non-PKI model, we could generate a small unique keypair, possibly ed25519, per worker instance. As long as the public key is associated with the worker on the Worker Manager, we can verify its signatures. This means we'll need to keep the worker information in Worker Manager as long as we need to verify its artifacts. We also need to decide if we generate the keypair in the Worker Manager and send the private key to the worker, or if we generate the keypair on the worker and send the public key to the Worker Manager.

This is the Worker Key.

We're currently assuming we're going the non-PKI model.

Cloud provisioned workers

Aiui, cloud provisioned workers have an identity document from the cloud provider. Once the worker identity is verified, we can store the public key with the rest of the worker information. If the key generation happened on the Worker Manager, we can pass down the private key to the worker.

Hardware workers

The security here will be colo- and subnet-based security. We need some way to add a keypair to the hardware workers, and get the public key into Worker Manager.

Key rotation / reused workerIds

We can generate a new key for every cloud instance, especially if they're short-lived. If we reuse cloud workerIds we need to be able to either return a set of valid public keys, or perhaps add the datetime the artifact was created to the public key request. We may also want to be able to rotate keys on a hardware worker without changing its workerId.

Public Key query endpoint

For the non-PKI solution, the Worker Manager will keep track of each worker's public key(s), and either return the set of valid public keys for a given workerId, or the valid public key for a given datetime.

Preserve important worker history until artifact expiration

For the non-PKI solution, the Worker Manager will need to keep track of the important (read: level 3) workers until their artifacts expire. Likely we'll need to specify which worker pools are "important" in configs, and we'll need a join in postgres to find the latest expiring artifacts uploaded by this workerId.

Artifact content signature

The ContentSha256 of an artifact guarantees that the artifact has not been modified between artifact upload and artifact download. By signing this ContentSha256 with the Worker Key, we also show that the artifact was uploaded by a worker under our control.

The text was updated successfully, but these errors were encountered:

escapewindow · 2020-02-05T22:45:42Z

@taskcluster/services-reviewers let me know if you have questions or comments?

djmitche · 2020-02-18T19:38:07Z

I think the join-until-artifact-expiration could be better accomplished by just setting an "expiration" value per workerPool, and setting that to 1 year for level-3 workers (or whatever the maximum artifact lifetime we want is). That saves a join and simplifies the model a bit.

To allow revocation, we could store keys in a separate table and the static provider (for hardware) could allow revocation of keys during creation of new keys. We could allow revocation (but not regeneration) of keys for cloud providers, too. Then each key would have a time-span during which it is valid, and that could be compared to the timestamp of any artifacts it signed.

escapewindow · 2020-02-18T22:50:09Z

I think the join-until-artifact-expiration could be better accomplished by just setting an "expiration" value per workerPool, and setting that to 1 year for level-3 workers (or whatever the maximum artifact lifetime we want is). That saves a join and simplifies the model a bit.

I think this works, as long as the 1 year expiry is counted after the most recent task run on that worker has completed... otherwise there will be some window where the artifact exists and the worker doesn't. Certainly if the worker lives less than 1 day, this may not be a big deal. If a worker lives for months (e.g. hardware), we may have issues unless we refresh the key or rotate workerIds regularly. We may still want to pad this: 1y + max_expected_worker_lifetime should cover it.

To allow revocation, we could store keys in a separate table and the static provider (for hardware) could allow revocation of keys during creation of new keys. We could allow revocation (but not regeneration) of keys for cloud providers, too. Then each key would have a time-span during which it is valid, and that could be compared to the timestamp of any artifacts it signed.

This made me realize that there's a period where the key is valid to sign new artifacts (the lifespan of the worker), and a period where the key is retrievable to verify signatures, but shouldn't be able to sign any new artifacts (the period between the worker going away, until the final artifact expires). I'm not sure how much we should address this: maybe a key-expires or worker-id-expires datestring, similar to taken-until/claim-expires ?

jvehent · 2020-02-19T16:38:43Z

Worker Manager provides a key to provisioned workers, using cloud provider instance identity

Could the workers generate the key and pass only the public key to the manager? That would prevent the manager from having access to sensitive key material.

Optional: could we leverage cloud features and use KMSs to hold those keys? That would remove the need to store & operate keys in the workers themselves, and would move the security control to the cloud provider instead. (Caveat: KMSs may not support signing operations).

escapewindow · 2020-02-19T18:48:00Z

Worker Manager provides a key to provisioned workers, using cloud provider instance identity

Could the workers generate the key and pass only the public key to the manager? That would prevent the manager from having access to sensitive key material.

This is possible, yes. The upside is the private key would never be transported over the wire or known by the manager, plus we don't run the risk of running low on entropy if we generate a large number of keypairs (not sure if this is as large a concern in newer crypto than, say, gpg). There is the potential for reusing keys or having some weaker algorithm on the workers, but we can address this with, say, worker runner, which can guarantee a specific version of the worker is installed. So yes, let's go with the generate-key-on-worker model.

Optional: could we leverage cloud features and use KMSs to hold those keys? That would remove the need to store & operate keys in the workers themselves, and would move the security control to the cloud provider instead. (Caveat: KMSs may not support signing operations).

Dustin pointed out that with the generate-key-on-worker model, this is an implementation detail. The cloud worker instance could potentially get the public key from the KMS, and submit that to the worker manager. We'd need to research the KMSs to a) make sure they support signing, and b) find out which signing algorithms they support, because that may influence our decision about what flavor of signing we use in general.

I'm under the impression that KMSs are only an option for cloud instances, and we'll still have to support key generation on the worker for hardware workers, so if we go this route, we'll need to support a hybrid approach.

jvehent · 2020-02-19T18:51:41Z

I'm under the impression that KMSs are only an option for cloud instances, and we'll still have to support key generation on the worker for hardware workers, so if we go this route, we'll need to support a hybrid approach.

Do we build artifacts on hardware workers? I genuinely don't know.

escapewindow · 2020-02-19T18:54:23Z

Yes, we have PGO profiles we generate on hardware, which we download and use to build release builds. I suppose we could determine whether these are low-risk enough to not need worker keys.

escapewindow · 2020-02-22T00:10:46Z

More points from discussion Wednesday:

Worker Manager worker expiration
- always have a max lifetime for cloud workers. expiration could be set to 1y + max_lifetime for those workers.
Worker Manager key management
- for hardware workers, new worker manager api endpoints:
  - register new hardware worker
  - existing hardware worker: rotate key
  - existing hardware worker: expire/revoke/delete key (multiple endpoints?)
- we could have a keys table, that maps many-to-one against the workers table.
  - valid_from, valid_to or not_before, not_after. The latter can be null if the key is currently active.
  - download tool could either get all keys for a given worker, or get a key for a given datestring
  - do we allow for multiple valid keys per worker, or do we expire the old valid key when we create a new valid key?

escapewindow · 2020-03-06T18:02:31Z

Hm. This issue covers 1) taskcluster-provisioned cloud instances, and 2) hardware workers. We have a third type of worker we'll need to cover in the firefoxci cluster: scriptworkers.

The mac signers are hardware, so could follow the pattern for (2). All other scriptworkers are currently docker containers running in k8s. If we're able to handle that in the cloud-provisioned solution, great. Otherwise we may need to use the hardware solution for them, or think of a third way.

djmitche · 2020-03-10T13:16:40Z

I suspect that the worker side of this functionality would be implemented in worker-runner, so it would "just work" for anything that uses the "static" provider. Depending on how dynamic that k8s deployment is, that might be easy or hard :)

djmitche assigned djmitche and unassigned djmitche Feb 18, 2020

escapewindow mentioned this issue Mar 9, 2020

Improve time-to-fail when a task group doesn't pass CoT mozilla-releng/scriptworker#348

Closed

escapewindow mentioned this issue Jul 15, 2020

rfc 0158 - artifact metadata #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker Identity and the Worker Key #157

Worker Identity and the Worker Key #157

escapewindow commented Feb 5, 2020

escapewindow commented Feb 5, 2020

djmitche commented Feb 18, 2020

escapewindow commented Feb 18, 2020

jvehent commented Feb 19, 2020

escapewindow commented Feb 19, 2020 •

edited

Loading

jvehent commented Feb 19, 2020

escapewindow commented Feb 19, 2020

escapewindow commented Feb 22, 2020 •

edited

Loading

escapewindow commented Mar 6, 2020

djmitche commented Mar 10, 2020

Worker Identity and the Worker Key #157

Worker Identity and the Worker Key #157

Comments

escapewindow commented Feb 5, 2020

Keypair

Cloud provisioned workers

Hardware workers

Key rotation / reused workerIds

Public Key query endpoint

Preserve important worker history until artifact expiration

Artifact content signature

escapewindow commented Feb 5, 2020

djmitche commented Feb 18, 2020

escapewindow commented Feb 18, 2020

jvehent commented Feb 19, 2020

escapewindow commented Feb 19, 2020 • edited Loading

jvehent commented Feb 19, 2020

escapewindow commented Feb 19, 2020

escapewindow commented Feb 22, 2020 • edited Loading

escapewindow commented Mar 6, 2020

djmitche commented Mar 10, 2020

escapewindow commented Feb 19, 2020 •

edited

Loading

escapewindow commented Feb 22, 2020 •

edited

Loading