Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#102 - Add overhead value to TestTask #243

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

worldomonation
Copy link
Contributor

@worldomonation worldomonation commented Jul 16, 2020

This implements the TestTask.overhead property that returns a float value representing non-group time.

Example Usage:

>>> import mozci
>>> from mozci.push import Push
>>> push = Push('b32bf0e2fa08e29748da69ac64fbe41659ad372c', branch='autoland')
>>> task = push.tasks[3]
>>> task.overhead
447.7479989528656

Or..

>>> for label, overhead in [(task.label, task.overhead) for task in push.tasks if type(task) == TestTask]:
...     print(label, overhead)
...
test-macosx1014-64/debug-mochitest-browser-chrome-e10s-4 141.60400009155273
test-linux1804-64/debug-mochitest-browser-chrome-fis-e10s-7 117.6800000667572
test-windows7-32/debug-web-platform-tests-reftest-e10s-2 581.414999961853
test-macosx1014-64/debug-xpcshell-e10s-2 110.23099994659424
test-macosx1014-64-shippable/opt-mochitest-devtools-chrome-e10s-3 118.6470000743866
test-linux1804-64-asan/opt-mochitest-browser-chrome-e10s-13 376.37199997901917
test-windows7-32/debug-mochitest-browser-chrome-e10s-3 293.8619999885559
test-linux1804-64-asan/opt-mochitest-browser-chrome-e10s-13 124.0220000743866

@worldomonation worldomonation changed the title Changes: #102 - Add overhead value to TestTask Jul 16, 2020
	- implement the TestTask.overhead property that returns a float value representing non-group time.
	- add new query to obtain start/end times for the group and task.
from: unittest
format: list
groupby: task.id
limit: 20000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such a high limit? If so, we might want to use destination: url like https://github.com/mozilla/mozci/blob/737c4cc0810bd745d13c90c511d18afff6baee20/mozci/queries/push_revisions.query.

Copy link
Contributor Author

@worldomonation worldomonation Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we do not - this was just carried over from the working query. It should return only 1 row. 1000 ought to be enough, though I'd like @klahnakoski to weigh in.

mozci/task.py Outdated Show resolved Hide resolved
mozci/task.py Outdated Show resolved Hide resolved
mozci/task.py Show resolved Hide resolved
@@ -358,6 +360,30 @@ def configuration(self):
parts = config.split("-")
return "-".join(parts[:-1] if parts[-1].isdigit() else parts)

@property
def overhead(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined on a specific task, but we will have to define overheads, total_overhead, median_overhead on LabelSummary to actually use the information (e.g. the median overhead for a given suite on a given platform/configuration).

For task-level scheduling, we can just consider the sum of the median durations of all scheduled tasks.
For manifest-level scheduling, for each task we'll have to sum the durations of the groups in the task and sum the overhead associated with the task (I think by having a median overhead by suite and platform/configuration, we might be precise enough; maybe even just platform/configuration?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will implement them while I wait for review from ahal and ekyle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@worldomonation I will require the group-level run times, but I plan to compute the aggregates after the data has been put into the database. Specifically, there is a BigQuery database with some of the data generated by mozci; my plan is to use include what these functions generate and form queries from there.

mozci/task.py Outdated
Comment on lines 380 to 381
assert data["task_min"] < data["group_min"]
assert data["task_max"] > data["group_max"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not crash on a problem: We should be emitting problems as "warnings" to the logs and also declaring the overhead a zero (our best estimate).

Depanding on your logging library, something like:

if not (data["task_min"] <= data["group_min"] <= data["group_max"] <= data["task_max"]):
    Log.warning(
        "task {{task.id}} has inconsistent group duration: {{data}}",  
        task=task,
        data=data
    )
    return 0

Copy link
Member

@ahal ahal Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't we crash on a problem?

I agree crashing on a problem is bad if you're writing an ETL, but mozci is not an ETL, it's a library. Some consumers of the library will want to ignore errors and keep going (like an ETL). Other consumers of the library will want to fail loudly so they can be notified immediately when things go awry. The nice thing about crashing is that it lets the consumers decide how to handle the situation.

Though assertions are trickier to catch than a proper Exception, so if we're going to do this validation it should be using something like ValueError and the fact ValueError can be raised should be well documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahal Since you are the owner, I will leave the error mitigation patterns to you. In the meantime, my reasons are:

  • We have a reasonable mitigation - The assert is catching logic bugs: raising exception when the data is incoherent. The definition of overhead is our "best estimate of overhead": In the event of incoherent data, a zero estimate is a legitimate return value.
  • Less work for callers - For every raised exception, there is at least one try/catch needed to handle that error. All callers must implement their own mitigation, possibly the same mitigation. I doubt anyone will be writing an exception handler for these asserts.
  • Callers have less context - Callers are less likely to know a suitable mitigation: It is likely the caller will let the error propagate up to terminate the program.
  • asserts are for debugging - These specific asserts exist to crash in the event of bad data; to crash in the event of unanticipated data; which is useful only in debugging. My suggestion is to shunt the error to stderr in production, and consume that channel offline to guide further development.
  • this is a rare event - I do not expect these asserts to be broken

Yes, letting an exception escape the callee allows the caller to decide mitigation, but I am not advocating suppression of all exceptions. In this case, we have a reasonable mitigation strategy that I doubt any caller will ever write a handler for; plus it remains visible to anyone monitoring stderr.

Of course, I can always be wrong. I am sure I have promoted past warnings to errors so that a better strategy can be implemented by a caller: In which case I advocate a more detailed description so the caller has the context to make the best decision:

if not (data["task_min"] <= data["group_min"] <= data["group_max"] <= data["task_max"]):
    Log.error(
        "task {{task.id}} has inconsistent group duration: {{data}}",  
        task=task,
        data=data
    )

@@ -358,6 +360,30 @@ def configuration(self):
parts = config.split("-")
return "-".join(parts[:-1] if parts[-1].isdigit() else parts)

@property
def overhead(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@worldomonation I will require the group-level run times, but I plan to compute the aggregates after the data has been put into the database. Specifically, there is a BigQuery database with some of the data generated by mozci; my plan is to use include what these functions generate and form queries from there.

	- implement overhead-related methods for LabelSummary
	- add overhead related abstract properties to GroupSummary, RunnableSummary
	- GroupSummary overhead will currently return 0
- {value: result.start_time, name: group_min, aggregate: min}
- {value: result.end_time, name: group_max, aggregate: max}
where:
- in: {task.id: {$eval: task_id}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having a separate query for this (or maybe in addition), I think we could refactor the existing push_task_results_from_unittest.query to include this information as well, saving ourselves from running a separate query.

I'd be fine if you decide to file a follow-up issue to implement this and save it for later.

return [task.overhead for task in self.tasks]

@property
def total_overheads(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't be pluralized

@@ -358,6 +360,30 @@ def configuration(self):
parts = config.split("-")
return "-".join(parts[:-1] if parts[-1].isdigit() else parts)

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be memoized

@worldomonation
Copy link
Contributor Author

I haven't forgotten about this PR yet, but I can't promise when I can get to it - job search takes first priority.

I can likely address @ahal's comments on the method name and memoization but I think it would be better to not refactor the ActiveData query at this point in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants