Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#102 - Add overhead value to TestTask #243

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions mozci/queries/test_task_overhead.query
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from: unittest
format: list
groupby: task.id
limit: 20000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such a high limit? If so, we might want to use destination: url like https://github.com/mozilla/mozci/blob/737c4cc0810bd745d13c90c511d18afff6baee20/mozci/queries/push_revisions.query.

Copy link
Contributor Author

@worldomonation worldomonation Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we do not - this was just carried over from the working query. It should return only 1 row. 1000 ought to be enough, though I'd like @klahnakoski to weigh in.

select:
- {value: action.start_time, name: task_min, aggregate: min}
- {value: action.end_time, name: task_max, aggregate: max}
- {value: result.start_time, name: group_min, aggregate: min}
- {value: result.end_time, name: group_max, aggregate: max}
where:
- in: {task.id: {$eval: task_id}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having a separate query for this (or maybe in addition), I think we could refactor the existing push_task_results_from_unittest.query to include this information as well, saving ourselves from running a separate query.

I'd be fine if you decide to file a follow-up issue to implement this and save it for later.

26 changes: 26 additions & 0 deletions mozci/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@
import json
import os
from abc import ABC, abstractmethod
from argparse import Namespace
from dataclasses import dataclass, field
from enum import Enum
from inspect import signature
from statistics import median
from typing import Dict, List, Optional

import requests
from adr.query import run_query
from adr.util import memoized_property
from loguru import logger
from urllib3.response import HTTPResponse
Expand Down Expand Up @@ -358,6 +360,30 @@ def configuration(self):
parts = config.split("-")
return "-".join(parts[:-1] if parts[-1].isdigit() else parts)

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be memoized

def overhead(self):
worldomonation marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined on a specific task, but we will have to define overheads, total_overhead, median_overhead on LabelSummary to actually use the information (e.g. the median overhead for a given suite on a given platform/configuration).

For task-level scheduling, we can just consider the sum of the median durations of all scheduled tasks.
For manifest-level scheduling, for each task we'll have to sum the durations of the groups in the task and sum the overhead associated with the task (I think by having a median overhead by suite and platform/configuration, we might be precise enough; maybe even just platform/configuration?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will implement them while I wait for review from ahal and ekyle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@worldomonation I will require the group-level run times, but I plan to compute the aggregates after the data has been put into the database. Specifically, there is a BigQuery database with some of the data generated by mozci; my plan is to use include what these functions generate and form queries from there.

"""Calculate the overhead of a task.

The methodology is simple: each task (action) has a start/end time.
Each group also has a start/end time. Take the earliest known group start
and latest known group end time, ensure the two falls somewhere in between
task start/end.

This definition of overhead does not take into account inter-group overhead
eg. restarting browser, teardown, etc.

Returns:
float: difference between task start/end and group start/end times.
"""
data = run_query("test_task_overhead", Namespace(task_id=self.id))["data"].pop()
# Sanity check to ensure group start/end times are within task start/end.
assert data["task_min"] < data["group_min"]
assert data["task_max"] > data["group_max"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not crash on a problem: We should be emitting problems as "warnings" to the logs and also declaring the overhead a zero (our best estimate).

Depanding on your logging library, something like:

if not (data["task_min"] <= data["group_min"] <= data["group_max"] <= data["task_max"]):
    Log.warning(
        "task {{task.id}} has inconsistent group duration: {{data}}",  
        task=task,
        data=data
    )
    return 0

Copy link
Member

@ahal ahal Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't we crash on a problem?

I agree crashing on a problem is bad if you're writing an ETL, but mozci is not an ETL, it's a library. Some consumers of the library will want to ignore errors and keep going (like an ETL). Other consumers of the library will want to fail loudly so they can be notified immediately when things go awry. The nice thing about crashing is that it lets the consumers decide how to handle the situation.

Though assertions are trickier to catch than a proper Exception, so if we're going to do this validation it should be using something like ValueError and the fact ValueError can be raised should be well documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahal Since you are the owner, I will leave the error mitigation patterns to you. In the meantime, my reasons are:

  • We have a reasonable mitigation - The assert is catching logic bugs: raising exception when the data is incoherent. The definition of overhead is our "best estimate of overhead": In the event of incoherent data, a zero estimate is a legitimate return value.
  • Less work for callers - For every raised exception, there is at least one try/catch needed to handle that error. All callers must implement their own mitigation, possibly the same mitigation. I doubt anyone will be writing an exception handler for these asserts.
  • Callers have less context - Callers are less likely to know a suitable mitigation: It is likely the caller will let the error propagate up to terminate the program.
  • asserts are for debugging - These specific asserts exist to crash in the event of bad data; to crash in the event of unanticipated data; which is useful only in debugging. My suggestion is to shunt the error to stderr in production, and consume that channel offline to guide further development.
  • this is a rare event - I do not expect these asserts to be broken

Yes, letting an exception escape the callee allows the caller to decide mitigation, but I am not advocating suppression of all exceptions. In this case, we have a reasonable mitigation strategy that I doubt any caller will ever write a handler for; plus it remains visible to anyone monitoring stderr.

Of course, I can always be wrong. I am sure I have promoted past warnings to errors so that a better strategy can be implemented by a caller: In which case I advocate a more detailed description so the caller has the context to make the best decision:

if not (data["task_min"] <= data["group_min"] <= data["group_max"] <= data["task_max"]):
    Log.error(
        "task {{task.id}} has inconsistent group duration: {{data}}",  
        task=task,
        data=data
    )


return (data["group_min"] - data["task_min"]) + (
data["task_max"] - data["group_max"]
)


# Don't perform type checking because of https://github.com/python/mypy/issues/5374.
@dataclass # type: ignore
Expand Down