-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Private Task Logs #159
Private Task Logs #159
Conversation
I'd also love feedback from @escapewindow and @armenzg. Note the link to https://bugzilla.mozilla.org/show_bug.cgi?id=1598689 which has a bunch of additional background. |
Also @ajvb. We should think through the security implications of allowing LiveLog artifacts to be overwritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to think this through, and write it up. I think it is a good first stab, but could do with a couple of tweaks.
rfcs/0159-private-task-logs.md
Outdated
The updated convention is, briefly, to make the artifact name containing logs configurable on a per-task basis, so that users may choose a `public/` or non-public name; and to introduce a new artifact type that will avoid the issues around authentication and redirects. | ||
|
||
The convention for task-related conventions is to put information in `task.extra`, using sub-objects to group related configuration. | ||
In the new design, the log artiact name is found at `task.extra.logs.taskLogArtifact`, with a default value of `public/logs/live.log` for compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can improve on this a little. The proposed design has a couple of disadvantages:
task.extra
is not part of the payload schema of workers, and they fetch the rest of their inputs from thepayload
section. The purpose of thepayload
section is to completely define what the worker should do, and they shouldn't be concerned with interpretingtask.extra
.- Some tasks/workers may wish to support multiple log files, rather than a single log file (perhaps some worker/task may wish to generate separate standard error and standard out logs, or invoke multiple tasks in parallel, each having a dedicated log file), so we should avoid anything that makes assumptions a single log will exist per task. Perhaps some tasks on some future worker won't want any log at all.
If instead, we consider this as artifact metadata, we can decorate the artifacts with information about their purpose or how they should be displayed in the task inspector, in the task.extra
section, but do it in a way that is independent of how the worker interprets the task payload.
In other words, I see the correct approach as being to have properties in the payload that tell the worker what artifact names to use for its logs, and separately have information in task.extra that can be used by the task inspector to provide it with context on what the artifacts are, how they should be displayed, etc. This way, the two can be changed independently of one another.
Since generic-worker is a worker that can currently only create a live log and a backing log, we could add a log
section to the payload, for example:
payload:
log:
live: public/logs/live.log
backing: public/logs/live_backing.log
Then for the task explorer, we can either label the artifacts with a type (much like a mime type) to hint how they should be displayed in the browser, or we can (for now) tell the task inspector which artifact is the primary log for displaying when you first visit the task. Depending on the approach, that could look like this:
extra:
artifact:
public/logs/live.log:
type: log
public/logs/live_backing.log:
type: log
or possibly like this:
extra:
taskinspector:
log: public/logs/live.log
depending on the preferred approach.
I've never been a fan of us having both public/logs/live.log
and public/logs/live_backing.log
as I find it confusing for users, who only ever care about seeing the actual log in its complete or incomplete state if the task is still running. I therefore agree with your suggestion of having a live log be updated to be a regular S3 artifact when the logging has completed, so that only a single artifact is needed. Indeed, we could indicate this in the generic-worker payload by simply not specifying a backing
log but specifying only a live
log. If neither are specified, backwards compatibility would be preserved by taking the former values as defaults, and the two artifacts would be created as before.
I would also be happy to adopt the convention, if we have only a single task log, that the log by default eventually becomes public/log/task.log
rather than public/log/live.log
or public/log/live_backing.log
as we have now, as this better reflects what it actually is.
I see a similar approach being suitable for docker-worker / script-worker etc, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking point 2 first, I think there's a strong sense among users of CI systems that there's a single "main" log that should be displayed by default. That doesn't preclude having other logs, but those can be named whatever the user would like and handled outside of the convention for the main logs (what this proposal calls "task logs"). For example, if there are both stderr.log and stdout.log, then the user would click on the one they want to see.
There's a related issue of how UIs can figure out which artifacts (aside from the main log) are logs and which are files to be downloaded, but I think that's out of scope for this RFC. The current approach (based on name, I think?) seems good enough.
To the first point, I understand the purity in the idea that the worker only looks at task.payload
. But the practical downside to this is that the user must then coordinate two values in the same task JSON, with a failure to do so correctly resulting in no logs appearing. From a user perspective that doesn't make a lot of sense. It also means that the user has to figure out how to configure that differently for each worker.
Workers are pretty practical beasts -- they already look at various other bits of the task such as deadline, expires, and scopes, and they log lots of maybe-useful information automatically.
Perhaps a compromise would be allowing configuration of the task log artifact name in the payload, but defaulting that to task.extra.logs.taskLogArtifact
, which in turn defaults to public/logs/live.log
?
It would be nice to have the default be task.log
, but I think in the interests of compatibility we'll need to stick with live.log
. The impact of logs suddenly "disappearing" for some group of users -- or the cost of carefully coordinating a migration to a new default -- would probably be worse than the impact of wondering why it's named live.log
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting to public will make migrating easier but will be a potentially surprising default later in time. I am thinking it would be better to ensure that current tasks we want to be public all update their task.extra before we land this change and default to private logs.
The nice thing about this is that the fallout if we make some tasks logs private by accident isn't that bad, we just use our scopes to go get the artifact for the interested party and update their config so that it Just Works in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage? Why would that be a surprising default? Are people surprised by that behavior in GitLab or Travis-CI?
Making logs private by default changes this from a relatively consequence-free change to a "contact everyone who uses Taskcluster and get them to spend their precious time migrating" project -- a substantially more difficult task and similar to a few dozen half-finished migrations we have lying around (using worker-runner everywhere, using static worker pools everywhere, taskQueueId, just to name a few)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of there being a single log artefact, as this was slightly surprising to me when I initially started learning about TaskCluster. Is there any reason that TC currently doesn't allow replacing artefacts after creation?
Possibly not a concern for now, but could we keep the log location in payload
if we had some means at task creation to pass arbitrary metadata and for this task pass something which to begin with just updates the field in extra
? (As an aside, arbitrary metadata would answer the download question since you could include 'Content-Disposition', which both S3 and GCS support setting on objects.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of there being a single log artefact, as this was slightly surprising to me when I initially started learning about TaskCluster. Is there any reason that TC currently doesn't allow replacing artefacts after creation?
This is an issue of data integrity -- we want to ensure that an artifact containing, say, a release binary doesn't change after it's created. It's also part of the general immutable design of Taskcluster. The exception for changing Reference artifacts hasn't proven problematic, so I think we could expand on that with LiveLog.
I'm not sure I understand the second paragraph, @ricky26. A task, once created, is immutable (similar reasons), so there's no way to "update" anything. So task.extra
and task.payload
must both be fully formed at the call to queue.createTask
.
That's a good point about content-disposition, and something we should think about in the Object Service (#155). I think we would want something visible in listArtifacts
so that the UI could display logs and files differently, meaning that it's metadata in the queue rather than the target storage system. But for direct links to artifacts, content-distribution would definitely help the browser know what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I didn't realise that replacing the reference assets was a special case! I follow the reasoning now, thanks for explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage? Why would that be a surprising default? Are people surprised by that behavior in GitLab or Travis-CI?
I would be surprised if I was a relatively unfamiliar user and had been using taskcluster in a project that had private logs that when I went to go set up a new project I had to set this one subfield of a subfield in a task definition to keep my logs from being public.
rfcs/0159-private-task-logs.md
Outdated
The convention for task-related conventions is to put information in `task.extra`, using sub-objects to group related configuration. | ||
In the new design, the log artiact name is found at `task.extra.logs.taskLogArtifact`, with a default value of `public/logs/live.log` for compatibility. | ||
Workers will be modified to consult this field before creating task log artifacts. | ||
UIs and other log consumers will be modified to consult this field before fetching task logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that artifact we care about is under a public
path, would we need authentication on Treeherder's backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, and I'd argue that TH should not be downloading non-public artifacts but just providing users a link to the task, so the changes to TH should be minor.
* Docker-Worker | ||
* Scriptworker | ||
* Implementation in other log consumers; known consumers: | ||
* Treeherder (NOTE: Treeherder currently fetches `public/logs/live_backing.log` directly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What timeline are you looking at to getting to this part?
What changes will be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeline isn't decided yet -- it will depend where this overall project falls in the TC team's priorities, and whether folks outside of the team can work on it. I suspect the TH changes would be relatively minor, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes would be:
- look for the artifact name in task.extra, defaulting to
public/tasks/live.log
- stop referring directly to
live_backing.log
The second step may need to be timed to be close to deployment of the LiveLog artifact type, as otherwise it means slower log processing as the requests follow a redirect from live.log
to live_backing.log
.
However, its URL cannot be modified, but it can be replaced with an artifact of another type. | ||
|
||
Workers create a LiveLog artifact when a task begins, and replace that with a "normal" artifact (such as S3) when the task completes. | ||
The URL in the LiveLog artifact should contain enough entropy that it cannot be guessed, preventing anyone who cannot read the LiveLog artifact from connecting to the worker to download logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the New Log Implementation
section (and this last paragraph in particular), it is unclear to me what the protections for this new "non-public" task log will be. Will a user be required to have the needed scopes to fetch these logs? How will that be implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and that's already implemented -- a caller needs queue:get-artifact:<name>
to retrieve an artifact except when the artifact name begins with public/
in which case it's public :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok totally, then why is The URL in the LiveLog artifact should contain enough entropy that it cannot be guessed, preventing anyone who cannot read the LiveLog artifact from connecting to the worker to download logs.
a thing? Is the problem here that a user who has scopes to interact with a worker directly but not download logs could just connect to the worker directly and download them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the URL for a live-log points directly (or via websocktunnel) to the worker, and the protection against anyone reading that data is that a portion of the URL contains a random value that we presume can't be guessed (it's a slugid, so ~128 bits from a CSPRNG). There's also a limited time window during which that data is available (the time during which the task is executing). The queue:get-artifact:<name>
scope would prevent unauthorized access to the URL.
It's not hard to imagine ways to do that better -- workers accepting TC credentials, websocktunnel doing authentication, etc. -- but it would be harder to implement those ideas. I'd like to leave them out-of-scope for this RFC, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djmitche Could we schedule some time to chat about this over zoom? I feel like I may be missing some context. If we are talking about utilizing private logs for things like critical security bugs logs and the only protection is a random value in the URL, that sounds like a big incident just waiting to happen. It just takes one bug that causes the URL to stay available (or something else) and we've got a big 'ol incident on our hands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. But basically that's what I'm proposing. Maybe for such cases we would disable live logging -- the access control for the log once the task is complete does not rely on a secret URL.
It looks like you're getting rid of Some thoughts:
Overall, 👍 |
Producing logs using this new convention is probably a good idea. I'm not sure if scriptworker will need to make changes in its verification process to use this new approach.
Those won't be known until the end of the task in many cases (e.g., if uploading a directory with artifact prefix
I don't know what those security implications were, but this RFC doesn't change them. That seems a fine approach.
I think the "final" log would need to be uploaded after all other artifacts are uploaded. I don't think that's a new issue in this RFC, but bears mentioning. |
@ajvb and I talked for a bit, about (a) live logs still being insecure (obscure URL is not sufficient) and (b) default value for artifact name. For (a), I'll update the proposal to disable live-logging when a non-public log artifact is used. This avoids anyone making a careful decision as to whether they trust the obscure URL -- the answer is just "no", you can't use live-logs if you want private logs. We can later figure out a way to authenticate live logging -- whether that be by the workers accepting Hawk credentials, websocktunnel accepting Hawk credentials, or some other method -- and once that's in place we can allow live-logging with private logs. For (b), we went a little further afield. @ajvb noted that it's weird that we use the name of the artifact to determine visibility, rather than having a "public" property of each artifact. Indeed, that is weird. @ricky26 suggested in chat last week that a good way to get to private-by-default may be to create a new role, maybe called Putting those two together, we could eventually get to a point where a deployment can be "locked down" and specifically logs can be private are the default by not adding Until then, private logs are a bit of a "boutique feature" and need to be handled with caution. All in all, this lets us keep this RFC very narrowly scoped without blocking making things private by default in the future and without risking disclosure of live logs. |
And just to set expectations -- once this is decided it will be a while until we can get development time on it. But it does divide nicely into smaller pieces and some of those pieces might be practical tasks for folks outside the team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajvb and I talked for a bit, about (a) live logs still being insecure (obscure URL is not sufficient) and (b) default value for artifact name.
For (a), I'll update the proposal to disable live-logging when a non-public log artifact is used. This avoids anyone making a careful decision as to whether they trust the obscure URL -- the answer is just "no", you can't use live-logs if you want private logs.
Private backing logs can be trivially achieved by redirecting output to a file. The motivation for a Private Logs RFC is to have streamable private logs. For example, the (Mozilla) Fuzzing team have tasks that last several hours, which they cannot monitor in real time without private log streaming. For me, if streaming logs are not part of the RFC, this RFC has little value, since private backing logs are already trivial to implement.
The motivation for this RFC is to have task logs that are private, that UIs and log consumers can identify as the main log. So it sounds like this particular RFC doesn't address the use-case to which you're referring, but doesn't preclude it, either. We certainly don't have bandwidth on the team to address the larger scope. So I guess the question is, given that this is of minimal scope and seemingly not useful for fuzzing, is it useful at all? |
@jschwartzentruber what do you think? I think the ideal approach would be that we have a streaming artifact type. Then there would be no reason for us to limit this to a single artifact, or to being a log file. The task inspector could preview e.g. up to a limit of 3 streaming artifacts, vertically split, or give you the option to choose which streaming artifacts you wish to preview. The streaming artifacts could serve as a pipe between tasks (see #121 (comment) for a little background on this idea - section "Example future use case"). I think this is a bigger win that implementing something that assumes tasks have a single streaming artifact that happens to be a log, and that all users should see that one streaming artifact by default. But of course, that is a bigger piece of work. So my preference is that we don't add a private log feature if you can't stream it, since that feature already exists by redirecting output to a file, and adding another feature to do the same would complicate matters by having multiple ways to do the same thing. |
Thanks for the comments. So, I think the useful forms of this idea are out of reach for the team and its current workload -- so far out of reach that it's not a great use of time for us to design a solution at this point. |
Sorry for the late replay, but I agree both that private logs should be streaming to be useful, and that a solution isn't necessary if it's too much work. It would be nice to have this in TC, but our tasks are logging to stackdriver in GCP, which wasn't difficult to set up. |
No description provided.