-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support functionalities to add user-provided, original labels. #446
base: master
Are you sure you want to change the base?
Conversation
… namespace support feature and test it.
gokart/task.py
Outdated
|
||
def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None) -> None: | ||
def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None, user_provided_labels: Optional[dict[Any, Any]] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, user_provided_labels
name is confusing. when user call task.dump, user can't predict what the labels will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some candidates for this variable.
I go for custom_labels
.
gokart/task.py
Outdated
|
||
def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None) -> None: | ||
def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None, user_provided_labels: Optional[dict[Any, Any]] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think label's key should be string.
gokart/gcs_obj_metadata_client.py
Outdated
return dict(metadata) | dict(labels) | ||
|
||
@staticmethod | ||
def _add_labels_to_metadata(labels_dict, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add type for args?
gokart/gcs_obj_metadata_client.py
Outdated
@staticmethod | ||
def _normalize_labels(task_params: Optional[dict[Any, str]], user_provided_labels: Optional[dict[Any, Any]]) -> tuple[dict[Any, str], dict[Any, str]]: | ||
def _normalize_labels_helper(_params: Optional[dict[Any, Any]]) -> dict[Any, str]: | ||
return {str(key): str(value) for key, value in _params.items()} if _params else {} | ||
|
||
return ( | ||
_normalize_labels_helper(task_params), | ||
_normalize_labels_helper(user_provided_labels), | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If _normalize_labels
does the same thing for both task_params
and user_provided_labels
, it would be better to have a single argument (e.g., labels
) and call it twice.
gokart/gcs_obj_metadata_client.py
Outdated
total_metadata_size, labels = GCSObjectMetadataClient._add_labels_to_metadata( | ||
normalized_user_provided_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys | ||
) | ||
_, labels = GCSObjectMetadataClient._add_labels_to_metadata( | ||
normalized_task_params_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think GCSObjectMetadataClient
only needs to focus on adding metadata to the cache and doesn't necessarily need to be aware of the differences between task_params
and user_provided_labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adieumonks Thanks for your comment. I understand that your point is that GCSObjectMetadataClient shouldn't need to distinguish between user-provided labels and labels generated from parameters. I agree with that, and one possible approach could be to merge both sets of labels beforehand as a preprocessing step. However, as mentioned in the comment, there is a possibility that the keys of user-provided labels and parameter-generated labels could conflict. Based on the use case, I want to prioritize the values of user-provided labels. If we implement this in a straightforward way, we would end up with multiple for loops performing similar processing. To avoid that redundancy, I extracted the logic into the _add_labels_to_metadata function, which led to the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a bad idea to overwrite task_params
with user_provided_labels
before calling the function, like GCSObjectMetadataClient.add_task_state_labels(task_params | user_provided_labels)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what GCSObjectMetadataClient.add_task_state_labels(task_params | user_provided_labels)
means, so I search it. From Python3.9, | operator merges dictionary, and right side is prioritized.
Then, I wanna log key conflict event, because if event is captured as log, user can know key conflicts, but if we adopt your suggestion, hard to log it.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off course, we number total keys, then after merging two dictionary, if the number of total key were changed, key conflicts might be happened, so we can log it.
But, some meaningless code would be written, to detect what exact keys cause key conflicts.
…pply changes to test.
@staticmethod | ||
def _is_log_related_path(path: str) -> bool: | ||
return ( | ||
('log/random_seed' in path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this?
('log/random_seed' in path) | |
return path.startwith('log/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, I think your suggeston is not acceptable.
Because if user wants to change default place to dump from main to log/, no labels would be added.
The reason why I write as following implementation, log/{hogehoge}
is hard coded in gokart implementation, so these passes are fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TlexCypher
OK!
But I cannot find the reason to specify the subpaths, please add comments the reasons.
Also, imo, using regex is more suitable for this case like r'^log/(processing_time|...).+\.txt$
?
Current implementation will be true for paths likefoo/log/pprocessing_time
.
Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using regex is better, as you said.
return dict(metadata) | dict(labels) | ||
|
||
@staticmethod | ||
def _add_labels_to_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name is ambiguous a little.
what does metadata
mean?
how about the following pattern?
please let me know your thoughts.
@dataclass
class LabelMetadata:
labels: dict[str, str] = Field(default_factory=dict)
label_keys: : set[str] = Field(default_factory=set)
total_metadata_size: int
def _add_labels_to_metadata( | |
def _create_or_append_metadata( | |
labels: dict[str, str], | |
max_gcs_metadata_size, | |
current_metadata: Optional[LabelMetadata] = None, | |
) -> LabelMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating a data class just for this is redundant.
It would be better to simply rename the method.
Related works
#445
By merging this PR, gokart can add task's parameter information to gcs cache.
What does this PR do?
This PR adds support for user-provided, original labels when saving a task output to GCS.
Why is this needed?
Users of gokart can store some data like logs, results, parameter information etc. as cache in Google Cloud Storage(GCS).
For now, gokart can provide labels, generated from each task parameter information when saving files to GCS.
But, when getting number of cached objects growing, in some cases, user want to add original labels to cache for more ease filtering.
So in this PR, support those kind way.
Checklist