An AWS EMR notebook provides Jupyterlab and Jupyter hosted in a t3.small sized container. It's hosted in an AWS managed account (244647660906) with an ENI that attaches the instance to your VPC. The instance does not have access to the internet.
The notebook uses a conda environment with python 3.7.8, and site packages at /opt/conda/lib/python3.7/site-packages/.
The EMR notebook environment has sparkmagic which is used to access your EMR cluster via Livy. Livy runs on the EMR master node and starts the Spark driver in process. Livy's REST API runs on port 18888.
The Spark Job monitoring widget is installed. The widget fetches and displays Spark job progress from the Spark Monitoring REST API. Its python package name is awseditorssparkmonitoringwidget.
The contents of /home/notebook/work/ is continually synced to S3. This is performed by the python package awseditorsstorage which is Jupyter ContentsManager built by AWS. There is no direct S3 access from the environment so it uses an HTTP service running on localhost called managed-workspace to transfer to/from S3.
The jupyterlab-git extension is also installed.
In order to connect to an EMR cluster the cluster needs to be running JupyterEnterpriseGateway.
The PySpark kernel is running on the master (inside the Livy process?)
On the first execution of any cell a Spark session will be started by Livy and the sparkmagic info widget displayed. The widget contains links to the Spark UI and Driver logs. These links require network level access to the cluster.
The spark
variable contains the spark session. Example of loading a dataset:
df = spark.read.csv("s3://ebirdst-data/ebirdst_run_names.csv")
To see all Livy sessions:
%%info
To see Livy logs:
%%logs
EMR stores all cluster logs at the Log URI specified when a cluster is created. Both the Log URI and the Cluster ID are visible from the EMR cluster Summary tab.
The Livy/Spark driver logs are written to livy-livy-server.out.gz
which is persisted to "${LOG_URI}/${CLUSTER_ID}/node/${MASTER_INSTANCE_ID}/applications/livy/livy-livy-server.out.gz"
The default Livy session timeout is 60 mins. If your job dies at 60 mins, and because of an Spark executor failure, then it's probably Livy timing out the session. To increase the timeout, use this EMR configuration:
[
{
"Classification": "livy-conf",
"Properties": {
"livy.server.session.timeout-check": "true",
"livy.server.session.timeout": "8h",
"livy.server.yarn.app-lookup-timeout": "120s"
}
}
]