A pipeline comprises one or more nodes that are (in many cases) connected with each other to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email.
A generic pipeline comprises nodes that are implemented using generic components. In the current release Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow.
In this introductory tutorial you will learn how to create a generic pipeline and run it in your local JupyterLab environment.
When you run a pipeline in your local environment, each Jupyter notebook or script is executed in a Kernel on the machine where JupyterLab is running, such as your laptop. Since resources on that machine might be limited local pipeline execution might not always be a viable option.
The Run generic pipelines on Kubeflow Pipelines tutorial and Run generic pipelines on Apache Airflow tutorial are similar to this tutorial but run the pipeline on Kubeflow Pipelines or Apache Airflow, enabling you to take advantage of shared compute resources in the cloud that might dramatically reduce pipeline processing time or allow for processing of much larger data volumes.
The tutorial instructions were last updated using Elyra version 3.0.
This tutorial uses the introduction to generic pipelines
sample from the https://github.com/elyra-ai/examples GitHub repository.
-
Launch JupyterLab.
Note: When you start JupyterLab using the
jupyter lab
command, it loads the contents of the current working directory.
For example, we recommend starting JupyterLab from a new directory on your system that is not an existing git repository. This will make sure you can clone the repository as mentioned in the next step. -
Open the Git clone wizard (Git > Clone A Repository).
-
Enter
https://github.com/elyra-ai/examples.git
as Clone URI. -
In the File Browser navigate to
examples/pipelines/introduction-to-generic-pipelines
.The cloned repository includes a set of files that download an open weather data set from the Data Asset Exchange, cleanse the data, analyze the data, and perform time-series predictions.
You are ready to start the tutorial.
-
Open the Launcher (File > New Launcher) if it is not already open.
-
Open the Generic Pipeline Editor to create a new untitled generic pipeline.
-
In the JupyterLab File Browser panel, right click on the untitled pipeline, and select ✎ Rename.
-
Change the pipeline name to
hello-generic-world
.To help others understand the purpose of the pipeline you should add a description.
-
In the Visual Pipeline Editor open the properties panel on the right side.
-
Select the Pipeline properties tab and enter a pipeline description.
-
Close the properties panel.
Next, you'll add a file to the pipeline that downloads an open data set archive from public cloud storage.
This tutorial includes a Jupyter notebook load_data.ipynb
and a Python script load_data.py
that perform the same data loading task.
For illustrative purposes the instructions use the notebook, but feel free to use the Python script. (The key takeaway is that you can mix and match notebooks and scripts, as desired.)
To add a notebook or script to the pipeline:
-
Expand the component palette panel on the left hand side. Note that there are multiple component entries, one for each supported file type.
-
Drag the notebook component entry onto the canvas (or double click on a palette entry) and hover over the node. The error messages are indicating that the node is not yet configured properly.
-
Select the newly added node on the canvas, right click, and select Open Properties from the context menu.
-
Configure the node properties.
Some properties are only required when you plan to run the pipeline in a remote environment, such as Kubeflow Pipelines. However, it is considered good practice to always specify those properties to allow for easy migration from development (where you might run a pipeline locally) to test and production (where you would want to take advantage of resources that are not available to you in a local environment). Details are in the instructions below.
-
Assign the node a descriptive label. If you leave the label empty, the file name (e.g.
load_data.ipynb
) will be used. -
Browse to the file location. Navigate to the
introduction-to-generic-pipelines
directory and selectload_data.ipynb
. -
As Runtime Image choose
Pandas
. The runtime image identifies the container image that is used to execute the notebook or Python script when the pipeline is run on Kubeflow Pipelines or Apache Airflow. This setting must always be specified but is ignored when you run the pipeline locally.If the container requires a specific minimum amount of resources during execution, you can specify them.
If no custom requirements are defined, the defaults in the target runtime environment (Kubeflow Pipelines or Apache Airflow) are used.
If a notebook or script requires access to local files, such as Python scripts, you can specify them as File Dependencies. When you run a pipeline locally this setting is ignored because the notebook or script can access all (readable) files in your workspace. However, it is considered good practice to explicitly declare file dependencies to make the pipeline also runnable in environments where notebooks or scripts are executed isolated from each other.
-
The
load_data
file does not have any input file dependencies. Leave the input field empty.If desired, you can customize additional inputs by defining environment variables.
-
Click refresh to scan the file for environment variable references. Refer to the best practices for file-based pipeline nodes to learn more about how Elyra discovers environment variables in notebooks and scripts.
It appears that
load_data
references two environment variables i.eDATASET_URL
andELYRA_RUNTIME_ENV
. TheDATASET_URL
requires to be set. This variable identifies the name and location of a data set file, which the notebook or script will download and decompress. TheELYRA_RUNTIME_ENV
is a read-only variable. For details refer to Proprietary environment variables. -
Assign environment variable
DATASET_URL
the appropriate value as shown below:DATASET_URL=https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz
If a notebook or script generates files that other notebooks or scripts require access to, specify them as Output Files. This setting is ignored if you are running a pipeline locally because all notebooks or scripts in a pipeline have access to the same shared file system. However, it is considered good practice to declare these files to make the pipeline also runnable in environments where notebooks or scripts are executed in isolation from each other.
-
Declare an output file named
data/noaa-weather-data-jfk-airport/jfk_weather.csv
, which other notebooks in this pipeline consume.It is considered good pratice to specify paths that are relative to the notebook or script location.
-
Close the node's properties view.
-
Select the
load_data
node and attach a comment to it.The comment is automatically attached to the node.
-
In the comment node enter a descriptive text, such as
Download the data
.
Next, you'll add a data pre-processing notebook to the pipeline and connect it with the first notebook in such a way that it is executed after the first notebook. This notebook cleans the data in data/noaa-weather-data-jfk-airport/jfk_weather.csv
, which load_data
produced.
Earlier in this tutorial you added a (notebook or Python script) file component to the canvas using the palette. You can also add Jupyter notebooks, Python scripts, or R scripts to the canvas by drag and drop from the JupyterLab File Browser.
-
From the JupyterLab File Browser drag and drop the
Part 1 - Data Cleaning.ipynb
notebook onto the canvas. -
Customize the file's execution properties as follows:
- Runtime image:
Pandas
- Output files:
data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv
- Runtime image:
-
Attach a comment to the
Part 1 - Data Cleaning
node and enter a description, such asClean the data
. -
Connect the output port of the
load_data
node to the input port of thePart 1 - Data Cleaning
node to establish a depency between the two notebooks. -
Save the pipeline.
You are ready to run the pipeline.
When you run a pipeline locally the files are executed on the machine where JupyterLab is running.
-
Click
Run pipeline
. -
Accept the default values in the run dialog and start the run.
-
Monitor the pipeline run in the JupyterLab console.
A message similar to the following is displayed in the pipeline editor window after the run completed.
A local pipeline run produces the following output artifacts:
- Each executed notebook is updated and includes the run results in the output cells.
- If any notebook persists data/files they are stored in the local file system.
You can access output artifacts from the File Browser. In the screen capture below the pipeline output artifacts are highlighted.
Elyra provides a command line interface that you can use to manage metadata and work with pipelines.
To run a pipeline locally using the elyra-pipeline
CLI:
-
Open a terminal window that has access to the Elyra installation.
$ elyra-pipeline --help Usage: elyra-pipeline [OPTIONS] COMMAND [ARGS]...
-
Run the pipeline.
$ elyra-pipeline run hello-generic-world.pipeline
The CLI does not require JupyterLab to be running.
This concludes the introduction to generic pipelines tutorial. You've learned how to
- create a generic pipeline
- add and configure Jupyter notebooks or scripts
- run the pipeline in a local environment from the pipeline editor
- run the pipeline in a local environment using the command line interface
- monitor the pipeline run progress
- inspect the pipeline run results
If you'd like you can extend the pipeline by adding two more notebooks, which can be executed in parallel after notebook Part 1 - Data Cleaning.ipynb
was processed:
Part 2 - Data Analysis.ipynb
Part 3 - Time Series Forecasting.ipynb
Each of the notebooks can run in the Pandas
container image and doesn't have any input dependencies, doesn't require any environment variables and doesn't produce any additional output files.