Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve failure status reason #117

Merged
merged 14 commits into from
Oct 26, 2023
Merged

feat: improve failure status reason #117

merged 14 commits into from
Oct 26, 2023

Conversation

ajberdy
Copy link
Contributor

@ajberdy ajberdy commented Oct 24, 2023

Issue #, if available:

Description of changes:

Log specific exception from customer method to failure status

Testing done:

Manual tests and unit test added, braket and sagemaker integ tests passing for base

Screenshot 2023-10-23 at 5 32 19 PM Screenshot 2023-10-23 at 5 32 07 PM

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have checked that my tests are not configured for a specific region or account (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ajberdy ajberdy marked this pull request as draft October 24, 2023 02:47
@ajberdy
Copy link
Contributor Author

ajberdy commented Oct 24, 2023

Pending test updates

done

@ajberdy ajberdy marked this pull request as ready for review October 24, 2023 03:45
try:
customer_method(**kwargs)
except Exception as exc:
queue.put(exc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to re-throw the exception here so that the customer execution script will end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just write to the output file and exit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code that runs inside the process, but we want to "log failure and exit" from the main container process. Putting the exception in the queue lets us transfer the information from the customer method process to the main container process. Logging the failure may work properly from within the child process, but calling sys.exit won't impact the parent process like we want.

Our current codepath

  • Child process start
  • Child process raises an exception
  • Child process joins (immediately after exception occurs)
  • Check if child process exited with code 0
  • If not, raise a generic exception

New codepath

  • Child process starts
  • Child process raises an exception
  • Exception is captured and put in the queue
  • Child process joins (immediately after exception occurs)
  • Check if there is an exception in the queue
  • If so, raise a specific exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to re-throw the exception here so that the customer execution script will end?

No, since there are no more lines of code in wrapped_customer_method. We may want to rethrow the exception for the sake of logging; I'll check what the impact is here

@ajberdy
Copy link
Contributor Author

ajberdy commented Oct 24, 2023

New changes will need testing updates

log_failure_and_exit(f"Unable to run job at entry point {entry_point}\nException: {e}")
log_failure_and_exit(f"Job did not exit gracefully.\nException: {e}")

customer_code_process.join()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing .join() twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

src/braket_container.py Show resolved Hide resolved
customer_module = importlib.import_module(str_module)
customer_method = getattr(customer_module, str_method)
else:
def customer_method():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be misnamed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can update the name to be "customer_executable"

src/braket_container.py Show resolved Hide resolved
src/braket_container.py Show resolved Hide resolved
src/braket_container.py Show resolved Hide resolved
src/braket_container.py Show resolved Hide resolved
mock_subprocess.run.assert_called_with(
["python", "-m", "test_entry_point"], cwd='/opt/braket/code/customer_code/extracted',
)
# mock_subprocess.run.assert_called_with(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup here and below?

src/braket_container.py Show resolved Hide resolved
customer_method = extract_customer_method(entry_point)
customer_method_process = kick_off_customer_script(customer_method)
join_customer_script(customer_method_process)
sys.exit(customer_method_process.exitcode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about exiting with 0 here, since we may want to add additional functionality after this method. I would check if it's non-zero (as we were doing before), and then exit with 0 (or SM won't propagate the error correctly). Or we should check if they've fixed that bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, still see a blank sm error message with a non-zero exit code. Fixed to return zero on failure as before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbf, there's no sagemaker error when exiting with zero anyways. The braket failure reason does get propagated in both cases, just a difference in the logs. I'd argue perhaps SM_ERROR="" is better than Reporting training SUCCESS, so maybe we want to use the correct exit code?

@ajberdy
Copy link
Contributor Author

ajberdy commented Oct 25, 2023

Ran manual tests successfully on the latest commit

@ajberdy ajberdy requested a review from krneta October 25, 2023 18:15
print("Code Run Finished")
return_code = result.returncode
return return_code
return customer_code_process.exitcode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're actually not using this returned value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

krneta
krneta previously approved these changes Oct 25, 2023
@ajberdy ajberdy merged commit 58e289b into main Oct 26, 2023
6 checks passed
@ajberdy ajberdy deleted the log-tb branch October 26, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants