Skip to content

Latest commit

 

History

History

introduction-to-generic-pipelines

Getting started with generic pipelines

A pipeline comprises one or more nodes that are (in many cases) connected with each other to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email.

A basic pipeline

A generic pipeline comprises nodes that are implemented using generic components. In the current release Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow.

Generic pipelines and supported runtime environments

In this introductory tutorial you will learn how to create a generic pipeline and run it in your local JupyterLab environment.

The completed tutorial pipeline

When you run a pipeline in your local environment, each Jupyter notebook or script is executed in a Kernel on the machine where JupyterLab is running, such as your laptop. Since resources on that machine might be limited local pipeline execution might not always be a viable option.

The Run generic pipelines on Kubeflow Pipelines tutorial and Run generic pipelines on Apache Airflow tutorial are similar to this tutorial but run the pipeline on Kubeflow Pipelines or Apache Airflow, enabling you to take advantage of shared compute resources in the cloud that might dramatically reduce pipeline processing time or allow for processing of much larger data volumes.

Prerequisites

The tutorial instructions were last updated using Elyra version 3.0.

Setup

This tutorial uses the introduction to generic pipelines sample from the https://github.com/elyra-ai/examples GitHub repository.

  1. Launch JupyterLab.

    Note: When you start JupyterLab using the jupyter lab command, it loads the contents of the current working directory.
    For example, we recommend starting JupyterLab from a new directory on your system that is not an existing git repository. This will make sure you can clone the repository as mentioned in the next step.

  2. Open the Git clone wizard (Git > Clone A Repository).

  3. Enter https://github.com/elyra-ai/examples.git as Clone URI.

  4. In the File Browser navigate to examples/pipelines/introduction-to-generic-pipelines.

    Tutorial assets in File Browser

    The cloned repository includes a set of files that download an open weather data set from the Data Asset Exchange, cleanse the data, analyze the data, and perform time-series predictions.

You are ready to start the tutorial.

Create a generic pipeline

  1. Open the Launcher (File > New Launcher) if it is not already open.

    Open the JupyterLab launcher

  2. Open the Generic Pipeline Editor to create a new untitled generic pipeline.

    Visual pipeline editor

  3. In the JupyterLab File Browser panel, right click on the untitled pipeline, and select ✎ Rename.

    Rename pipeline

  4. Change the pipeline name to hello-generic-world.

    To help others understand the purpose of the pipeline you should add a description.

  5. In the Visual Pipeline Editor open the properties panel on the right side.

    Open the properties panel

  6. Select the Pipeline properties tab and enter a pipeline description.

    Add pipeline description

  7. Close the properties panel.

Next, you'll add a file to the pipeline that downloads an open data set archive from public cloud storage.

Add a notebook or script to the pipeline

This tutorial includes a Jupyter notebook load_data.ipynb and a Python script load_data.py that perform the same data loading task.

For illustrative purposes the instructions use the notebook, but feel free to use the Python script. (The key takeaway is that you can mix and match notebooks and scripts, as desired.)

To add a notebook or script to the pipeline:

  1. Expand the component palette panel on the left hand side. Note that there are multiple component entries, one for each supported file type.

    Empty generic pipeline

  2. Drag the notebook component entry onto the canvas (or double click on a palette entry) and hover over the node. The error messages are indicating that the node is not yet configured properly.

    Component configuration error

  3. Select the newly added node on the canvas, right click, and select Open Properties from the context menu.

    Open node properties

  4. Configure the node properties.

    Configure node properties

    Some properties are only required when you plan to run the pipeline in a remote environment, such as Kubeflow Pipelines. However, it is considered good practice to always specify those properties to allow for easy migration from development (where you might run a pipeline locally) to test and production (where you would want to take advantage of resources that are not available to you in a local environment). Details are in the instructions below.

  5. Assign the node a descriptive label. If you leave the label empty, the file name (e.g. load_data.ipynb) will be used.

    Edit node name

  6. Browse to the file location. Navigate to the introduction-to-generic-pipelines directory and select load_data.ipynb.

    Select file to run

  7. As Runtime Image choose Pandas. The runtime image identifies the container image that is used to execute the notebook or Python script when the pipeline is run on Kubeflow Pipelines or Apache Airflow. This setting must always be specified but is ignored when you run the pipeline locally.

    Configure runtime image

    If the container requires a specific minimum amount of resources during execution, you can specify them.

    Customize container resources

    If no custom requirements are defined, the defaults in the target runtime environment (Kubeflow Pipelines or Apache Airflow) are used.

    If a notebook or script requires access to local files, such as Python scripts, you can specify them as File Dependencies. When you run a pipeline locally this setting is ignored because the notebook or script can access all (readable) files in your workspace. However, it is considered good practice to explicitly declare file dependencies to make the pipeline also runnable in environments where notebooks or scripts are executed isolated from each other.

  8. The load_data file does not have any input file dependencies. Leave the input field empty.

    Configure file dependencies

    If desired, you can customize additional inputs by defining environment variables.

  9. Click refresh to scan the file for environment variable references. Refer to the best practices for file-based pipeline nodes to learn more about how Elyra discovers environment variables in notebooks and scripts.

    Scan file for environment variables

    It appears that load_data references two environment variables i.e DATASET_URL and ELYRA_RUNTIME_ENV. The DATASET_URL requires to be set. This variable identifies the name and location of a data set file, which the notebook or script will download and decompress. The ELYRA_RUNTIME_ENV is a read-only variable. For details refer to Proprietary environment variables.

  10. Assign environment variable DATASET_URL the appropriate value as shown below:

    DATASET_URL=https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz
    

    Configure environment variables

    If a notebook or script generates files that other notebooks or scripts require access to, specify them as Output Files. This setting is ignored if you are running a pipeline locally because all notebooks or scripts in a pipeline have access to the same shared file system. However, it is considered good practice to declare these files to make the pipeline also runnable in environments where notebooks or scripts are executed in isolation from each other.

  11. Declare an output file named data/noaa-weather-data-jfk-airport/jfk_weather.csv, which other notebooks in this pipeline consume.

    Configure output files

    It is considered good pratice to specify paths that are relative to the notebook or script location.

  12. Close the node's properties view.

  13. Select the load_data node and attach a comment to it.

    Add node comment

    The comment is automatically attached to the node.

  14. In the comment node enter a descriptive text, such as Download the data.

    Add node comment

Next, you'll add a data pre-processing notebook to the pipeline and connect it with the first notebook in such a way that it is executed after the first notebook. This notebook cleans the data in data/noaa-weather-data-jfk-airport/jfk_weather.csv, which load_data produced.

Add a second file to the pipeline

Earlier in this tutorial you added a (notebook or Python script) file component to the canvas using the palette. You can also add Jupyter notebooks, Python scripts, or R scripts to the canvas by drag and drop from the JupyterLab File Browser.

  1. From the JupyterLab File Browser drag and drop the Part 1 - Data Cleaning.ipynb notebook onto the canvas.

    Add file from file browser

  2. Customize the file's execution properties as follows:

    • Runtime image: Pandas
    • Output files: data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv
  3. Attach a comment to the Part 1 - Data Cleaning node and enter a description, such as Clean the data.

  4. Connect the output port of the load_data node to the input port of the Part 1 - Data Cleaning node to establish a depency between the two notebooks.

    Connect two notebook nodes

  5. Save the pipeline.

    Save pipeline

You are ready to run the pipeline.

Run a generic pipeline from the pipeline editor

When you run a pipeline locally the files are executed on the machine where JupyterLab is running.

  1. Click Run pipeline.

    Run pipeline

  2. Accept the default values in the run dialog and start the run.

    Run pipeline locally

  3. Monitor the pipeline run in the JupyterLab console.

    Monitor pipeline run

    A message similar to the following is displayed in the pipeline editor window after the run completed.

    Run pipeline locally

Inspect the pipeline run results

A local pipeline run produces the following output artifacts:

  • Each executed notebook is updated and includes the run results in the output cells.
  • If any notebook persists data/files they are stored in the local file system.

You can access output artifacts from the File Browser. In the screen capture below the pipeline output artifacts are highlighted.

View local pipeline run output artifacts

Run a generic pipeline using the CLI

Elyra provides a command line interface that you can use to manage metadata and work with pipelines.

To run a pipeline locally using the elyra-pipeline CLI:

  1. Open a terminal window that has access to the Elyra installation.

    $ elyra-pipeline --help
       Usage: elyra-pipeline [OPTIONS] COMMAND [ARGS]...
    
  2. Run the pipeline.

    $ elyra-pipeline run hello-generic-world.pipeline 
    

    The CLI does not require JupyterLab to be running.

Next steps

This concludes the introduction to generic pipelines tutorial. You've learned how to

  • create a generic pipeline
  • add and configure Jupyter notebooks or scripts
  • run the pipeline in a local environment from the pipeline editor
  • run the pipeline in a local environment using the command line interface
  • monitor the pipeline run progress
  • inspect the pipeline run results

If you'd like you can extend the pipeline by adding two more notebooks, which can be executed in parallel after notebook Part 1 - Data Cleaning.ipynb was processed:

  • Part 2 - Data Analysis.ipynb
  • Part 3 - Time Series Forecasting.ipynb

Each of the notebooks can run in the Pandas container image and doesn't have any input dependencies, doesn't require any environment variables and doesn't produce any additional output files.

The completed tutorial pipeline

Resources