Private Task Logs #159

djmitche · 2020-05-01T15:15:05Z

No description provided.

djmitche · 2020-05-01T15:19:52Z

I'd also love feedback from @escapewindow and @armenzg.

Note the link to https://bugzilla.mozilla.org/show_bug.cgi?id=1598689 which has a bunch of additional background.

djmitche · 2020-05-01T15:21:50Z

Also @ajvb. We should think through the security implications of allowing LiveLog artifacts to be overwritten.

petemoore

Thanks for taking the time to think this through, and write it up. I think it is a good first stab, but could do with a couple of tweaks.

petemoore · 2020-05-01T15:35:57Z

rfcs/0159-private-task-logs.md

+The updated convention is, briefly, to make the artifact name containing logs configurable on a per-task basis, so that users may choose a `public/` or non-public name; and to introduce a new artifact type that will avoid the issues around authentication and redirects.
+
+The convention for task-related conventions is to put information in `task.extra`, using sub-objects to group related configuration.
+In the new design, the log artiact name is found at `task.extra.logs.taskLogArtifact`, with a default value of `public/logs/live.log` for compatibility.


I think we can improve on this a little. The proposed design has a couple of disadvantages:

task.extra is not part of the payload schema of workers, and they fetch the rest of their inputs from the payload section. The purpose of the payload section is to completely define what the worker should do, and they shouldn't be concerned with interpreting task.extra.

Some tasks/workers may wish to support multiple log files, rather than a single log file (perhaps some worker/task may wish to generate separate standard error and standard out logs, or invoke multiple tasks in parallel, each having a dedicated log file), so we should avoid anything that makes assumptions a single log will exist per task. Perhaps some tasks on some future worker won't want any log at all.

If instead, we consider this as artifact metadata, we can decorate the artifacts with information about their purpose or how they should be displayed in the task inspector, in the task.extra section, but do it in a way that is independent of how the worker interprets the task payload.

In other words, I see the correct approach as being to have properties in the payload that tell the worker what artifact names to use for its logs, and separately have information in task.extra that can be used by the task inspector to provide it with context on what the artifacts are, how they should be displayed, etc. This way, the two can be changed independently of one another.

Since generic-worker is a worker that can currently only create a live log and a backing log, we could add a log section to the payload, for example:

payload: log: live: public/logs/live.log backing: public/logs/live_backing.log

Then for the task explorer, we can either label the artifacts with a type (much like a mime type) to hint how they should be displayed in the browser, or we can (for now) tell the task inspector which artifact is the primary log for displaying when you first visit the task. Depending on the approach, that could look like this:

extra: artifact: public/logs/live.log: type: log public/logs/live_backing.log: type: log

or possibly like this:

extra: taskinspector: log: public/logs/live.log

depending on the preferred approach.

I've never been a fan of us having both public/logs/live.log and public/logs/live_backing.log as I find it confusing for users, who only ever care about seeing the actual log in its complete or incomplete state if the task is still running. I therefore agree with your suggestion of having a live log be updated to be a regular S3 artifact when the logging has completed, so that only a single artifact is needed. Indeed, we could indicate this in the generic-worker payload by simply not specifying a backing log but specifying only a live log. If neither are specified, backwards compatibility would be preserved by taking the former values as defaults, and the two artifacts would be created as before.

I would also be happy to adopt the convention, if we have only a single task log, that the log by default eventually becomes public/log/task.log rather than public/log/live.log or public/log/live_backing.log as we have now, as this better reflects what it actually is.

I see a similar approach being suitable for docker-worker / script-worker etc, of course.

Taking point 2 first, I think there's a strong sense among users of CI systems that there's a single "main" log that should be displayed by default. That doesn't preclude having other logs, but those can be named whatever the user would like and handled outside of the convention for the main logs (what this proposal calls "task logs"). For example, if there are both stderr.log and stdout.log, then the user would click on the one they want to see.

There's a related issue of how UIs can figure out which artifacts (aside from the main log) are logs and which are files to be downloaded, but I think that's out of scope for this RFC. The current approach (based on name, I think?) seems good enough.

To the first point, I understand the purity in the idea that the worker only looks at task.payload. But the practical downside to this is that the user must then coordinate two values in the same task JSON, with a failure to do so correctly resulting in no logs appearing. From a user perspective that doesn't make a lot of sense. It also means that the user has to figure out how to configure that differently for each worker.

Workers are pretty practical beasts -- they already look at various other bits of the task such as deadline, expires, and scopes, and they log lots of maybe-useful information automatically.

Perhaps a compromise would be allowing configuration of the task log artifact name in the payload, but defaulting that to task.extra.logs.taskLogArtifact, which in turn defaults to public/logs/live.log?

It would be nice to have the default be task.log, but I think in the interests of compatibility we'll need to stick with live.log. The impact of logs suddenly "disappearing" for some group of users -- or the cost of carefully coordinating a migration to a new default -- would probably be worse than the impact of wondering why it's named live.log.

Defaulting to public will make migrating easier but will be a potentially surprising default later in time. I am thinking it would be better to ensure that current tasks we want to be public all update their task.extra before we land this change and default to private logs.

The nice thing about this is that the fallout if we make some tasks logs private by accident isn't that bad, we just use our scopes to go get the artifact for the interested party and update their config so that it Just Works in the future.

What's the advantage? Why would that be a surprising default? Are people surprised by that behavior in GitLab or Travis-CI?

Making logs private by default changes this from a relatively consequence-free change to a "contact everyone who uses Taskcluster and get them to spend their precious time migrating" project -- a substantially more difficult task and similar to a few dozen half-finished migrations we have lying around (using worker-runner everywhere, using static worker pools everywhere, taskQueueId, just to name a few)

I like the idea of there being a single log artefact, as this was slightly surprising to me when I initially started learning about TaskCluster. Is there any reason that TC currently doesn't allow replacing artefacts after creation?

Possibly not a concern for now, but could we keep the log location in payload if we had some means at task creation to pass arbitrary metadata and for this task pass something which to begin with just updates the field in extra? (As an aside, arbitrary metadata would answer the download question since you could include 'Content-Disposition', which both S3 and GCS support setting on objects.)

I like the idea of there being a single log artefact, as this was slightly surprising to me when I initially started learning about TaskCluster. Is there any reason that TC currently doesn't allow replacing artefacts after creation?

This is an issue of data integrity -- we want to ensure that an artifact containing, say, a release binary doesn't change after it's created. It's also part of the general immutable design of Taskcluster. The exception for changing Reference artifacts hasn't proven problematic, so I think we could expand on that with LiveLog.

I'm not sure I understand the second paragraph, @ricky26. A task, once created, is immutable (similar reasons), so there's no way to "update" anything. So task.extra and task.payload must both be fully formed at the call to queue.createTask.

That's a good point about content-disposition, and something we should think about in the Object Service (#155). I think we would want something visible in listArtifacts so that the UI could display logs and files differently, meaning that it's metadata in the queue rather than the target storage system. But for direct links to artifacts, content-distribution would definitely help the browser know what to do.

Aha, I didn't realise that replacing the reference assets was a special case! I follow the reasoning now, thanks for explaining.

What's the advantage? Why would that be a surprising default? Are people surprised by that behavior in GitLab or Travis-CI?

I would be surprised if I was a relatively unfamiliar user and had been using taskcluster in a project that had private logs that when I went to go set up a new project I had to set this one subfield of a subfield in a task definition to keep my logs from being public.

rfcs/0159-private-task-logs.md

armenzg · 2020-05-01T16:40:09Z

rfcs/0159-private-task-logs.md

+The convention for task-related conventions is to put information in `task.extra`, using sub-objects to group related configuration.
+In the new design, the log artiact name is found at `task.extra.logs.taskLogArtifact`, with a default value of `public/logs/live.log` for compatibility.
+Workers will be modified to consult this field before creating task log artifacts.
+UIs and other log consumers will be modified to consult this field before fetching task logs.


If that artifact we care about is under a public path, would we need authentication on Treeherder's backend?

No, and I'd argue that TH should not be downloading non-public artifacts but just providing users a link to the task, so the changes to TH should be minor.

armenzg · 2020-05-01T16:40:30Z

rfcs/0159-private-task-logs.md

+   * Docker-Worker
+   * Scriptworker
+ * Implementation in other log consumers; known consumers:
+   * Treeherder (NOTE: Treeherder currently fetches `public/logs/live_backing.log` directly)


What timeline are you looking at to getting to this part?
What changes will be needed?

Timeline isn't decided yet -- it will depend where this overall project falls in the TC team's priorities, and whether folks outside of the team can work on it. I suspect the TH changes would be relatively minor, too.

I think the changes would be:

look for the artifact name in task.extra, defaulting to public/tasks/live.log

stop referring directly to live_backing.log

The second step may need to be timed to be close to deployment of the LiveLog artifact type, as otherwise it means slower log processing as the requests follow a redirect from live.log to live_backing.log.

ajvb · 2020-05-01T16:51:43Z

rfcs/0159-private-task-logs.md

+However, its URL cannot be modified, but it can be replaced with an artifact of another type.
+
+Workers create a LiveLog artifact when a task begins, and replace that with a "normal" artifact (such as S3) when the task completes.
+The URL in the LiveLog artifact should contain enough entropy that it cannot be guessed, preventing anyone who cannot read the LiveLog artifact from connecting to the worker to download logs.


Reading the New Log Implementation section (and this last paragraph in particular), it is unclear to me what the protections for this new "non-public" task log will be. Will a user be required to have the needed scopes to fetch these logs? How will that be implemented?

Yes, and that's already implemented -- a caller needs queue:get-artifact:<name> to retrieve an artifact except when the artifact name begins with public/ in which case it's public :)

Ok totally, then why is The URL in the LiveLog artifact should contain enough entropy that it cannot be guessed, preventing anyone who cannot read the LiveLog artifact from connecting to the worker to download logs. a thing? Is the problem here that a user who has scopes to interact with a worker directly but not download logs could just connect to the worker directly and download them?

Currently, the URL for a live-log points directly (or via websocktunnel) to the worker, and the protection against anyone reading that data is that a portion of the URL contains a random value that we presume can't be guessed (it's a slugid, so ~128 bits from a CSPRNG). There's also a limited time window during which that data is available (the time during which the task is executing). The queue:get-artifact:<name> scope would prevent unauthorized access to the URL.

It's not hard to imagine ways to do that better -- workers accepting TC credentials, websocktunnel doing authentication, etc. -- but it would be harder to implement those ideas. I'd like to leave them out-of-scope for this RFC, if possible.

@djmitche Could we schedule some time to chat about this over zoom? I feel like I may be missing some context. If we are talking about utilizing private logs for things like critical security bugs logs and the only protection is a random value in the URL, that sounds like a big incident just waiting to happen. It just takes one bug that causes the URL to stay available (or something else) and we've got a big 'ol incident on our hands.

We can. But basically that's what I'm proposing. Maybe for such cases we would disable live logging -- the access control for the log once the task is complete does not rely on a secret URL.

escapewindow · 2020-05-01T19:33:37Z

It looks like you're getting rid of live_backing.log, correct?

Some thoughts:

It sounds like we should support task.extra.logs.taskLogArtifact in scriptworker, for consistency. I think we'll want to check for any artifact path conflicts (e.g., if the taskLogArtifact and a non-log-artifact want to publish to the same path) and throw a malformed-payload or the like.
We chose not to support live logs in scriptworker because of security implications. Have we resolved those? If so, we may want to revisit that decision. If not, we'll probably continue uploading a single taskLogArtifact at the end of the task.
If we implement this alongside rfc 0158 - artifact metadata #158, we can choose to upload the S3 artifact taskLogArtifact with checksums. If we do so, we need to stop writing to the log when we calculate the checksum, and we may lose information about any following steps, including uploads. This artifact would then satisfy the requirement of a verifiable log, and we could eliminate certified.log.

Overall, 👍

djmitche · 2020-05-01T20:07:31Z

It sounds like we should support task.extra.logs.taskLogArtifact in scriptworker, for consistency.

Producing logs using this new convention is probably a good idea. I'm not sure if scriptworker will need to make changes in its verification process to use this new approach.

I think we'll want to check for any artifact path conflicts (e.g., if the taskLogArtifact and a non-log-artifact want to publish to the same path) and throw a malformed-payload or the like.

Those won't be known until the end of the task in many cases (e.g., if uploading a directory with artifact prefix public/, the conflict only occurs if a file named logs/live.log exists when the task completes), so malformed-payload may not be the right exception reason, but we should detect this case and handle with a clear error message.

We chose not to support live logs in scriptworker because of security implications. Have we resolved those? If so, we may want to revisit that decision. If not, we'll probably continue uploading a single taskLogArtifact at the end of the task.

I don't know what those security implications were, but this RFC doesn't change them. That seems a fine approach.

If we implement this alongside #158, we can choose to upload the S3 artifact taskLogArtifact with checksums. If we do so, we need to stop writing to the log when we calculate the checksum, and we may lose information about any following steps, including uploads. This artifact would then satisfy the requirement of a verifiable log, and we could eliminate certified.log.

I think the "final" log would need to be uploaded after all other artifacts are uploaded. I don't think that's a new issue in this RFC, but bears mentioning.

rfcs/0159-private-task-logs.md

djmitche · 2020-05-05T18:13:46Z

@ajvb and I talked for a bit, about (a) live logs still being insecure (obscure URL is not sufficient) and (b) default value for artifact name.

For (a), I'll update the proposal to disable live-logging when a non-public log artifact is used. This avoids anyone making a careful decision as to whether they trust the obscure URL -- the answer is just "no", you can't use live-logs if you want private logs.

We can later figure out a way to authenticate live logging -- whether that be by the workers accepting Hawk credentials, websocktunnel accepting Hawk credentials, or some other method -- and once that's in place we can allow live-logging with private logs.

For (b), we went a little further afield.

@ajvb noted that it's weird that we use the name of the artifact to determine visibility, rather than having a "public" property of each artifact. Indeed, that is weird.

@ricky26 suggested in chat last week that a good way to get to private-by-default may be to create a new role, maybe called any-request, that is implicitly assumed on every request (including for unauthenticated users) and then update every currently-unauthenticated API method to require scopes. Deployments that wish to remain mostly-public, like Firefox-CI and Community, can then add those new scopes to the new role and things will continue to work as before.

Putting those two together, we could eventually get to a point where a deployment can be "locked down" and specifically logs can be private are the default by not adding queue:get-artifact:public/* to the any-request role.

Until then, private logs are a bit of a "boutique feature" and need to be handled with caution.

All in all, this lets us keep this RFC very narrowly scoped without blocking making things private by default in the future and without risking disclosure of live logs.

djmitche · 2020-05-05T19:11:02Z

And just to set expectations -- once this is decided it will be a while until we can get development time on it. But it does divide nicely into smaller pieces and some of those pieces might be practical tasks for folks outside the team.

petemoore

@ajvb and I talked for a bit, about (a) live logs still being insecure (obscure URL is not sufficient) and (b) default value for artifact name.

For (a), I'll update the proposal to disable live-logging when a non-public log artifact is used. This avoids anyone making a careful decision as to whether they trust the obscure URL -- the answer is just "no", you can't use live-logs if you want private logs.

Private backing logs can be trivially achieved by redirecting output to a file. The motivation for a Private Logs RFC is to have streamable private logs. For example, the (Mozilla) Fuzzing team have tasks that last several hours, which they cannot monitor in real time without private log streaming. For me, if streaming logs are not part of the RFC, this RFC has little value, since private backing logs are already trivial to implement.

djmitche · 2020-05-06T12:52:06Z

The motivation for this RFC is to have task logs that are private, that UIs and log consumers can identify as the main log. So it sounds like this particular RFC doesn't address the use-case to which you're referring, but doesn't preclude it, either.

We certainly don't have bandwidth on the team to address the larger scope.

So I guess the question is, given that this is of minimal scope and seemingly not useful for fuzzing, is it useful at all?

petemoore · 2020-05-12T10:34:09Z

So I guess the question is, given that this is of minimal scope and seemingly not useful for fuzzing, is it useful at all?

@jschwartzentruber what do you think?

I think the ideal approach would be that we have a streaming artifact type. Then there would be no reason for us to limit this to a single artifact, or to being a log file. The task inspector could preview e.g. up to a limit of 3 streaming artifacts, vertically split, or give you the option to choose which streaming artifacts you wish to preview. The streaming artifacts could serve as a pipe between tasks (see #121 (comment) for a little background on this idea - section "Example future use case"). I think this is a bigger win that implementing something that assumes tasks have a single streaming artifact that happens to be a log, and that all users should see that one streaming artifact by default. But of course, that is a bigger piece of work. So my preference is that we don't add a private log feature if you can't stream it, since that feature already exists by redirecting output to a file, and adding another feature to do the same would complicate matters by having multiple ways to do the same thing.

djmitche · 2020-05-12T13:11:06Z

Thanks for the comments. So, I think the useful forms of this idea are out of reach for the team and its current workload -- so far out of reach that it's not a great use of time for us to design a solution at this point.

jschwartzentruber · 2020-05-12T13:45:42Z

Sorry for the late replay, but I agree both that private logs should be streaming to be useful, and that a solution isn't necessary if it's too much work. It would be nice to have this in TC, but our tasks are logging to stackdriver in GCP, which wasn't difficult to set up.

djmitche force-pushed the private-logs branch from 9676dde to 9c7b868 Compare May 1, 2020 15:16

djmitche marked this pull request as ready for review May 1, 2020 15:16

djmitche added the Phase: Proposal label May 1, 2020

djmitche requested review from petemoore, helfi92, imbstack, milescrabill and walac May 1, 2020 15:17

petemoore requested changes May 1, 2020

View reviewed changes

armenzg reviewed May 1, 2020

View reviewed changes

ajvb reviewed May 1, 2020

View reviewed changes

Private Task Logs proposal

dfe621a

djmitche changed the title ~~Private Task Logs proposal~~ Private Task Logs May 1, 2020

allow task.payload to also specify the task log artifact name

9e90817

djmitche force-pushed the private-logs branch from 9c7b868 to 9e90817 Compare May 1, 2020 18:00

djmitche requested a review from petemoore May 1, 2020 18:00

include security considerations and check for name conflicts

b9cac1b

imbstack reviewed May 1, 2020

View reviewed changes

rfcs/0159-private-task-logs.md Show resolved Hide resolved

walac removed their request for review May 4, 2020 12:05

helfi92 reviewed May 4, 2020

View reviewed changes

rfcs/0159-private-task-logs.md Outdated Show resolved Hide resolved

add link regarding HTTP clients, auth, and redirects

c37c75f

address security issues

810e130

djmitche requested a review from imbstack May 5, 2020 19:09

petemoore reviewed May 6, 2020

View reviewed changes

djmitche closed this May 12, 2020

djmitche mentioned this pull request Sep 17, 2020

RFC: Add scopes for anonymous calls #165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Private Task Logs #159

Private Task Logs #159

djmitche commented May 1, 2020

djmitche commented May 1, 2020

djmitche commented May 1, 2020

petemoore left a comment

petemoore May 1, 2020 •

edited

Loading

djmitche May 1, 2020

imbstack May 1, 2020

djmitche May 1, 2020 •

edited

Loading

ricky26 May 2, 2020

djmitche May 4, 2020

ricky26 May 4, 2020

imbstack May 4, 2020

armenzg May 1, 2020

djmitche May 1, 2020

armenzg May 1, 2020

djmitche May 1, 2020

djmitche May 1, 2020

ajvb May 1, 2020

djmitche May 1, 2020

ajvb May 1, 2020

djmitche May 1, 2020

ajvb May 1, 2020

djmitche May 1, 2020

escapewindow commented May 1, 2020

djmitche commented May 1, 2020

djmitche commented May 5, 2020

djmitche commented May 5, 2020

petemoore left a comment

djmitche commented May 6, 2020

petemoore commented May 12, 2020 •

edited

Loading

djmitche commented May 12, 2020

jschwartzentruber commented May 12, 2020 •

edited

Loading

Private Task Logs #159

Private Task Logs #159

Conversation

djmitche commented May 1, 2020

djmitche commented May 1, 2020

djmitche commented May 1, 2020

petemoore left a comment

Choose a reason for hiding this comment

petemoore May 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djmitche May 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

escapewindow commented May 1, 2020

djmitche commented May 1, 2020

djmitche commented May 5, 2020

djmitche commented May 5, 2020

petemoore left a comment

Choose a reason for hiding this comment

djmitche commented May 6, 2020

petemoore commented May 12, 2020 • edited Loading

djmitche commented May 12, 2020

jschwartzentruber commented May 12, 2020 • edited Loading

petemoore May 1, 2020 •

edited

Loading

djmitche May 1, 2020 •

edited

Loading

petemoore commented May 12, 2020 •

edited

Loading

jschwartzentruber commented May 12, 2020 •

edited

Loading