Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cost command #240

Merged
merged 25 commits into from
Mar 30, 2023
Merged

Add cost command #240

merged 25 commits into from
Mar 30, 2023

Conversation

bshifaw
Copy link
Collaborator

@bshifaw bshifaw commented Jan 24, 2023

The cost command will provide the cost for running a workflow, and have a "detailed" option that breaks down the cost per task of a workflow

@bshifaw bshifaw self-assigned this Jan 24, 2023
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bshifaw Cool! I've reviewed it for you. Quick turnaround but lots of comments.

@@ -5,7 +5,9 @@
LOGGER = logging.getLogger(__name__)


CONFIG_FILE_TEMPLATE = {"cromwell_server": "str", "requests_timeout": "int"}
CONFIG_FILE_TEMPLATE = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a silly nitpick, but I might add line breaks so each attribute pair was on it's own line. I start to get lost in lists like this when there are more than a few elements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
Get the cost for a workflow."
Only works for workflows that completed more than 8 hours ago on GCS."
Requires the 'gcp_bq_cost_table.config' configuration file to exist and contain"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still its own file? Or is it in the general config file now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general config now. fixed

)

# check if workflow id exists
if not workflow_id_utils.workflow_id_exists(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future refactoring note, I assume we need this check pretty much every time we get the workflow id so maybe we could bake the error checking into the get function. (Or another function that wraps i and does the error checking).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)

# check that bq setting in json config
check_cost_table_is_configured(config_options=config.cromshell_config_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be configurable per cromwell server? I assume some servers log to different BQ tables than others?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the BQ tables are configured by the gcp project being used. There could be a situation where someone sets up or has access to multiple cromwell servers using different billing projects but I don't know how likely that would be.

from cromshell.utilities import http_utils

formatted_metadata_parameter = metadata_command.format_metadata_params(
list_of_keys=config.METADATA_KEYS_TO_OMIT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably set this metadata call to get only the key we need for the timestamps instead of using the general metadata settings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, using the following keys : ["start", "status", "id", "end", "backend", "workflowProcessingEvents"]
The "backend" key lists the backend used to run the workflow. This is useful to confirm that the workflow was executed on the cloud (Papi, Tes) and later to check for cost in google or azure. The not-so-great part about this key is nested in "calls" key so we endup getting all the task keys aswell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why they do it that way, is it possible to execute cromwell with multiple backends simultaenously?

if row["cost"] is not None:
if row["cost"] != "cost":
cost = round(float(row["cost"]), 2)
if cost >= .01:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to put a .01 in? Is it to remove 0's so it's clear that there was SOME cost?

It seems like this if/else should be replaced with max(cost, .01)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what I assume. Here it is in CS1
replaced with max


def check_cost_table_is_configured(config_options: dict) -> None:
if "bq_cost_table" not in config_options:
raise KeyError('Cromshell config file is missing "bq_cost_table" Key')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a custom error of some sort? It's something correctable by the user. (maybe print the path to the config file that's in use as well?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the function to the cromshell_config_options_file_utils, added the custom error, and generalized the function such that it will check whatever key provided


# Todo: remove LIMIT in query
if detailed:
query_job = client.query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a structured query builder AP so you're not doing direct string interpolation? It's probably not an issue but queries built like this are insanely vulnerable to sql injection type attacks. Unless of course python is clever enough to use a custom interpolation engine here which performs escaping. I have heard about for some languages but I'm not sure which.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! There's some stuff about it here that can be implemented. I was able to parameterize all user input except table name, which the article expertly says it can't do.

:param end_date:
:return:
"""
from google.cloud import bigquery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it considered good practice to do imports as locally as possible? It always seems weird to me because it means it's hard to tell what a file depends on and it won't fail for missing imports until later but maybe that's part of the python style?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import statements can be executed just about anywhere. It's often useful to place them inside functions to restrict their visibility and/or reduce initial startup time. Although Python's interpreter is optimized to not import the same module multiple times, repeatedly executing an import statement can seriously affect performance in some circumstances.

I was thinking local would be savvy because we could do some checks (has it been 24 hours, does workflow id exist) before importing the library. But I can move it back up to the top for visibility, I don't think its that big of an import.

detailed=detailed
)

temp_query_result_csv = NamedTemporaryFile()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we writing all these temp files? Can't we just do this processing in memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@bshifaw bshifaw added the Cromshell 2 Issues related to Cromshell 2.0 label Feb 15, 2023
@bshifaw bshifaw marked this pull request as ready for review March 2, 2023 19:11
@@ -64,32 +64,6 @@ def override_requests_cert_parameters(skip_certs: bool) -> None:
)


class WorkflowStatuses(Enum):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to its own file called workflow_status_utils.py

bshifaw added 3 commits March 17, 2023 09:59
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bshifaw very minor comments
looks good after those

README.md Outdated
#### Get cost for a workflow
* `cost [-c] [-d] [workflow-id] [[workflow-id]...]`
* Get the cost for a workflow.
* Only works for workflows that completed more than 8 hours ago on GCS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 hours here, 24 below?

Requires the 'bq_cost_table' key in the cromshell configuration file to be
set to the big query cost table for your organization.

Costs here DO NOT include any call cached tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to at least annotate which tasks were call cached.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created issue ticket
#250

def main(config, workflow_ids: str or int, detailed: bool, color: bool):
"""
Get the cost for a workflow.
Only works for workflows that completed more than 24 hours ago on GCS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 or 24?

"""
Get the cost for a workflow.
Only works for workflows that completed more than 24 hours ago on GCS.
Requires the 'bq_cost_table' key in the cromshell configuration file to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is listed as bq_cost_table but gcp_bq_cost_table in the readme

README.md Outdated
#### Get cost for a workflow
* `cost [-c] [-d] [workflow-id] [[workflow-id]...]`
* Get the cost for a workflow.
* Only works for workflows that completed more than 8 hours ago on GCS.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@bshifaw bshifaw merged commit 3a323e4 into cromshell_2.0 Mar 30, 2023
@bshifaw bshifaw deleted the bs_cost_command branch March 30, 2023 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cromshell 2 Issues related to Cromshell 2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants