-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve failure status reason #117
Conversation
done |
src/braket_container.py
Outdated
try: | ||
customer_method(**kwargs) | ||
except Exception as exc: | ||
queue.put(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to re-throw the exception here so that the customer execution script will end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just write to the output file and exit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is code that runs inside the process, but we want to "log failure and exit" from the main container process. Putting the exception in the queue lets us transfer the information from the customer method process to the main container process. Logging the failure may work properly from within the child process, but calling sys.exit won't impact the parent process like we want.
Our current codepath
- Child process start
- Child process raises an exception
- Child process joins (immediately after exception occurs)
- Check if child process exited with code 0
- If not, raise a generic exception
New codepath
- Child process starts
- Child process raises an exception
- Exception is captured and put in the queue
- Child process joins (immediately after exception occurs)
- Check if there is an exception in the queue
- If so, raise a specific exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to re-throw the exception here so that the customer execution script will end?
No, since there are no more lines of code in wrapped_customer_method
. We may want to rethrow the exception for the sake of logging; I'll check what the impact is here
New changes will need testing updates |
src/braket_container.py
Outdated
log_failure_and_exit(f"Unable to run job at entry point {entry_point}\nException: {e}") | ||
log_failure_and_exit(f"Job did not exit gracefully.\nException: {e}") | ||
|
||
customer_code_process.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we doing .join() twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
src/braket_container.py
Outdated
customer_module = importlib.import_module(str_module) | ||
customer_method = getattr(customer_module, str_method) | ||
else: | ||
def customer_method(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be misnamed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can update the name to be "customer_executable"
mock_subprocess.run.assert_called_with( | ||
["python", "-m", "test_entry_point"], cwd='/opt/braket/code/customer_code/extracted', | ||
) | ||
# mock_subprocess.run.assert_called_with( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup here and below?
src/braket_container.py
Outdated
customer_method = extract_customer_method(entry_point) | ||
customer_method_process = kick_off_customer_script(customer_method) | ||
join_customer_script(customer_method_process) | ||
sys.exit(customer_method_process.exitcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about exiting with 0 here, since we may want to add additional functionality after this method. I would check if it's non-zero (as we were doing before), and then exit with 0 (or SM won't propagate the error correctly). Or we should check if they've fixed that bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, still see a blank sm error message with a non-zero exit code. Fixed to return zero on failure as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbf, there's no sagemaker error when exiting with zero anyways. The braket failure reason does get propagated in both cases, just a difference in the logs. I'd argue perhaps SM_ERROR=""
is better than Reporting training SUCCESS
, so maybe we want to use the correct exit code?
Ran manual tests successfully on the latest commit |
print("Code Run Finished") | ||
return_code = result.returncode | ||
return return_code | ||
return customer_code_process.exitcode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're actually not using this returned value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Issue #, if available:
Description of changes:
Log specific exception from customer method to failure status
Testing done:
Manual tests and unit test added, braket and sagemaker integ tests passing for base
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.