In this tutorial, we show you how to use AWS tools directly to execute and schedule notebooks. To make the process easier, we have provided a CloudFormation template to set up the Lambda function you'll need and some IAM roles and policies that you'll use when running notebooks. We've also provided scripts for building and customizing the Docker container images that SageMaker Processing Jobs will use when running the notebooks.
While the library, command-line interface, and GUI options provide more convenience, there are several reasons why you may prefer to build your own infrastructure and use it to execute and schedule notebooks:
- You want to integrate notebook execution into an existing system that already sets up much of the environment.
- You want to customize the system in ways not supported by the other tools.
- You have specific security requirements that mean you want to customize/restrict permissions built into the CloudFormation template.
- You want to invoke these services from a language other than Python.
- Any other reason you want.
Note that there isn't an either/or choice. You can customize your infrastructure or containers with the resources provided here and then use the sagemaker-run-notebook
library to run notebooks on that infrastructure. Or you can create your infrastructure with run-notebook create-infrastructure
and then use the methods shown below to execute notebooks from Java code.
What you'll need:
- AWS client code installed. This can be the AWS CLI or AWS language bindings for your language.
- To have AWS credentials set up that give you full permission on SageMaker, IAM, CloudFormation, Lambda, Cloudwatch Events, and ECR.
- Docker installed locally.
- Two files from the release on GitHub: cloudformation.yml and container.tar.gz.
Note: This tutorial shows all these operations using the AWS CLI, but the equivalent operations using the Boto3 library in Python or language bindings in other languages will work just as well.
$ aws cloudformation create-stack --stack-name sagemaker-run-notebook --template-body file://$(pwd)/cloudformation.yml --capabilities CAPABILITY_NAMED_IAM
To see if the stack was successfully created, you can use the command:
$ aws cloudformation describe-stacks --stack-name sagemaker-run-notebook
And the StackStatus
in the command should be CREATE_COMPLETE
.
One of the policies created here is ExecuteNotebookClientPolicy-us-east-1
(replace us-east-1
with the name of the region you're running in). If you're not running with administrative permissions, you should add that policy to the user or role that you're using to invoke and schedule notebooks. For complete information on the roles and policies as well as the source code for the Lambda function, see the cloudformation.yml
file which you can view on GitHub or download from the latest release.
Jobs run in SageMaker Processing Jobs run inside a Docker container. For this project, we have defined the container to include a script to set up the environment and run Papermill on the input notebook.
The container.tar.gz
file in the latest release contains everything you need to build and customize the container. You can edit the requirements.txt
file to specify Python libraries that your notebooks will need as described [in the pip documentation][requirements].
$ tar xvf container.tar.gz
$ cd container
<edit requirements.txt to add any dependencies you need>
$ ./build-and-push.sh notebook-runner
Note: You must have Docker installed for
build-and-push.sh
to work. If you prefer not to install and run Docker, use the convenience package as described in Create a Docker container to run the notebook. That technique uses CodeBuild to build the container image to use.
$ aws s3 cp mynotebook.ipynb s3://mybucket/
(replace "mynotebook" and "mybucket" with appropriate values)
$ aws lambda invoke --function-name RunNotebook --payload '{"input_path": "s3://mybucket/mynotebook.ipynb", "parameters": {"p": 0.75}}' result.json
The file result.json
will have your SageMaker Processing job name.
Now the job is running and it will take a few minutes to provision a node for processing and run.
The Lambda function provided has many options for how your notebook is run (like customizing the container image, using a specific IAM role, etc.). You can see all of these in the Lambda function definition included in the CloudFormation template.
You should feel free to customize the way the Lambda function calls SageMaker Processing to add any custom behavior that you would like to have.
Extract the job name from the result.json and run the following command to view the job status:
$ aws sagemaker describe-processing-job --processing-job-name myjob --output json
{
...
"ProcessingOutputConfig": {
"Outputs": [
{
...
"S3Output": {
"S3Uri": "s3://mybucket",
...
}
}
]
},
...
"Environment": {
...
"PAPERMILL_OUTPUT": "/opt/ml/processing/output/mynotebook-2020-06-08-03-44-04.ipynb",
...
},
...
"ProcessingJobStatus": "InProgress",
...
}
By viewing the status in the result, you can see whether the job is in progress, succeeded, or failed.
When the job succeeds, the output is saved back to S3. To find the S3 object name, combine the output path from the processing job with the base file name from the environment variable PAPERMILL_OUTPUT
. For instance, in the above example the output notebook would be s3://mybucket//mynotebook-2020-06-08-03-44-04.ipynb
.
You can then use the following command to copy the notebook to your local current directory:
$ aws s3 cp s3://outputpath/basename.ipynb .
You can use CloudWatch Events to schedule notebook executions. For example, to run the notebook that we previously uploaded every day at 1:15 use the following commands:
$ aws events put-rule --name "RunNotebook-test" --schedule "cron(15 1 * * ? *)"
$ aws lambda add-permission --statement-id EB-RunNotebook-test \
--action lambda:InvokeFunction \
--function-name RunNotebook \
--principal events.amazonaws.com \
--source-arn arn:aws:events:us-east-1:981276346578:rule/RunNotebook-test
$ aws events put-targets --rule RunNotebook-test \
--targets '[{"Id": "Default", "Arn": "arn:aws:lambda:us-east-1:981276346578:function:RunNotebook", "Input": "{ \"input_path\": \"s3://mybucket/mynotebook.ipynb\", \"parameters\": {\"p\": 0.75}}"}]'
Substitute your account number in place of 981276346578 in the source ARN and the Lambda function ARN. Substitute the region name your working in for "us-east-1" in both ARNs.
Substitute the location where you stored your notebook as the input
path argument.
The Input
field in the put-targets
call are the arguments to the Lambda function and they can be customized to anything the Lambda accepts. (See cloudformation.yml
for the Lambda function definition.)
Note that times are always in UTC. To see the full rules on times, view the Cloudwatch Events documentation here: Schedule Expressions for Rules
When the notebook has run, you can find the jobs with aws sagemaker list-processing-jobs
and then describe the job and download the notebook as described above.