-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cost command #240
Add cost command #240
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bshifaw Cool! I've reviewed it for you. Quick turnaround but lots of comments.
@@ -5,7 +5,9 @@ | |||
LOGGER = logging.getLogger(__name__) | |||
|
|||
|
|||
CONFIG_FILE_TEMPLATE = {"cromwell_server": "str", "requests_timeout": "int"} | |||
CONFIG_FILE_TEMPLATE = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a silly nitpick, but I might add line breaks so each attribute pair was on it's own line. I start to get lost in lists like this when there are more than a few elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/cromshell/cost/command.py
Outdated
""" | ||
Get the cost for a workflow." | ||
Only works for workflows that completed more than 8 hours ago on GCS." | ||
Requires the 'gcp_bq_cost_table.config' configuration file to exist and contain" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still its own file? Or is it in the general config file now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general config now. fixed
src/cromshell/cost/command.py
Outdated
) | ||
|
||
# check if workflow id exists | ||
if not workflow_id_utils.workflow_id_exists( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future refactoring note, I assume we need this check pretty much every time we get the workflow id so maybe we could bake the error checking into the get function. (Or another function that wraps i and does the error checking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/cromshell/cost/command.py
Outdated
) | ||
|
||
# check that bq setting in json config | ||
check_cost_table_is_configured(config_options=config.cromshell_config_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be configurable per cromwell server? I assume some servers log to different BQ tables than others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the BQ tables are configured by the gcp project being used. There could be a situation where someone sets up or has access to multiple cromwell servers using different billing projects but I don't know how likely that would be.
src/cromshell/cost/command.py
Outdated
from cromshell.utilities import http_utils | ||
|
||
formatted_metadata_parameter = metadata_command.format_metadata_params( | ||
list_of_keys=config.METADATA_KEYS_TO_OMIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably set this metadata call to get only the key we need for the timestamps instead of using the general metadata settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, using the following keys : ["start", "status", "id", "end", "backend", "workflowProcessingEvents"]
The "backend" key lists the backend used to run the workflow. This is useful to confirm that the workflow was executed on the cloud (Papi, Tes) and later to check for cost in google or azure. The not-so-great part about this key is nested in "calls" key so we endup getting all the task keys aswell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why they do it that way, is it possible to execute cromwell with multiple backends simultaenously?
src/cromshell/cost/command.py
Outdated
if row["cost"] is not None: | ||
if row["cost"] != "cost": | ||
cost = round(float(row["cost"]), 2) | ||
if cost >= .01: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to put a .01 in? Is it to remove 0's so it's clear that there was SOME cost?
It seems like this if/else should be replaced with max(cost, .01)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's what I assume. Here it is in CS1
replaced with max
src/cromshell/cost/command.py
Outdated
|
||
def check_cost_table_is_configured(config_options: dict) -> None: | ||
if "bq_cost_table" not in config_options: | ||
raise KeyError('Cromshell config file is missing "bq_cost_table" Key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a custom error of some sort? It's something correctable by the user. (maybe print the path to the config file that's in use as well?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the function to the cromshell_config_options_file_utils, added the custom error, and generalized the function such that it will check whatever key provided
src/cromshell/cost/command.py
Outdated
|
||
# Todo: remove LIMIT in query | ||
if detailed: | ||
query_job = client.query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a structured query builder AP so you're not doing direct string interpolation? It's probably not an issue but queries built like this are insanely vulnerable to sql injection type attacks. Unless of course python is clever enough to use a custom interpolation engine here which performs escaping. I have heard about for some languages but I'm not sure which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! There's some stuff about it here that can be implemented. I was able to parameterize all user input except table name, which the article expertly says it can't do.
src/cromshell/cost/command.py
Outdated
:param end_date: | ||
:return: | ||
""" | ||
from google.cloud import bigquery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it considered good practice to do imports as locally as possible? It always seems weird to me because it means it's hard to tell what a file depends on and it won't fail for missing imports until later but maybe that's part of the python style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import statements can be executed just about anywhere. It's often useful to place them inside functions to restrict their visibility and/or reduce initial startup time. Although Python's interpreter is optimized to not import the same module multiple times, repeatedly executing an import statement can seriously affect performance in some circumstances.
I was thinking local would be savvy because we could do some checks (has it been 24 hours, does workflow id exist) before importing the library. But I can move it back up to the top for visibility, I don't think its that big of an import.
src/cromshell/cost/command.py
Outdated
detailed=detailed | ||
) | ||
|
||
temp_query_result_csv = NamedTemporaryFile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we writing all these temp files? Can't we just do this processing in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
…e_utils.py. Specified variable types in cromshellconfig
…mat metadata keys and retrieve workflow metadata. To make function more reusable because other commands would only want the metadata and not print it.
8bfd7e4
to
02a0163
Compare
…, removed description and workflowid columns from detailed report
…utils.py enum call
…me, first round of unit tests added, additional versions of helloworld metadata jsons added.
@@ -64,32 +64,6 @@ def override_requests_cert_parameters(skip_certs: bool) -> None: | |||
) | |||
|
|||
|
|||
class WorkflowStatuses(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to its own file called workflow_status_utils.py
… cost detailed made sure 2 decimal points are printed. updated unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bshifaw very minor comments
looks good after those
README.md
Outdated
#### Get cost for a workflow | ||
* `cost [-c] [-d] [workflow-id] [[workflow-id]...]` | ||
* Get the cost for a workflow. | ||
* Only works for workflows that completed more than 8 hours ago on GCS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 hours here, 24 below?
Requires the 'bq_cost_table' key in the cromshell configuration file to be | ||
set to the big query cost table for your organization. | ||
|
||
Costs here DO NOT include any call cached tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to at least annotate which tasks were call cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created issue ticket
#250
def main(config, workflow_ids: str or int, detailed: bool, color: bool): | ||
""" | ||
Get the cost for a workflow. | ||
Only works for workflows that completed more than 24 hours ago on GCS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 or 24?
""" | ||
Get the cost for a workflow. | ||
Only works for workflows that completed more than 24 hours ago on GCS. | ||
Requires the 'bq_cost_table' key in the cromshell configuration file to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is listed as bq_cost_table but gcp_bq_cost_table in the readme
README.md
Outdated
#### Get cost for a workflow | ||
* `cost [-c] [-d] [workflow-id] [[workflow-id]...]` | ||
* Get the cost for a workflow. | ||
* Only works for workflows that completed more than 8 hours ago on GCS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The cost command will provide the cost for running a workflow, and have a "detailed" option that breaks down the cost per task of a workflow