Skip to content

Latest commit

 

History

History
261 lines (182 loc) · 7.55 KB

README.md

File metadata and controls

261 lines (182 loc) · 7.55 KB

Kubeflow demo - Yelp restaurant reviews

This repository contains a demonstration of Kubeflow capabilities, suitable for presentation to public audiences.

The base demo includes the following steps:

  1. Setup your environment
  2. Run training on CPUs
  3. Run training on TPUs
  4. Create the serving and UI components
  5. Bring up a notebook
  6. Run a simple pipeline
  7. Perform hyperparameter tuning
  8. Run a better pipeline
  9. Cleanup

1. Setup your environment

Follow the instructions in demo_setup/README.md to setup your environment and install Kubeflow with pipelines on an auto-provisioning GKE cluster with support for GPUs and TPUs. Note: This was tested using the v0.3.4-rc.1 branch with a cherry-pick of #1955.

View the installed components in the GCP Console.

  • In the Kubernetes Engine section, you will see a new cluster ${CLUSTER} with 3 n1-standard-1 nodes
  • Under Workloads, you will see all the default Kubeflow and pipeline components.

Source the environment file and activate the conda environment for pipelines:

source kubeflow-demo-base.env
source activate kfp

2. Run training on CPUs

Navigate to the ksonnet app directory created by kfctl and retrieve the following files for the t2tcpu & t2ttpu jobs:

cd ks_app
cp ${DEMO_REPO}/demo/components/t2t*pu.* components
cp ${DEMO_REPO}/demo/components/params.* components

Set parameter values for training:

ks param set t2tcpu outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_CPU}

Generate manifests and apply to cluster:

ks apply default -c t2tcpu

View the new training pod and wait until it has a Running status:

kubectl get pod -l tf_job_name=t2tcpu

View the logs to watch training commence:

kubectl logs -f t2tcpu-master-0 | grep INFO:tensorflow

3. Run training on TPUs

Set parameter values for training:

ks param set t2ttpu outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_TPU}

Kick off training:

ks apply default -c t2ttpu

Verify that a TPU is being provisioned by viewing pod status. It should remain in Pending state for 3-4 minutes with the message Creating Cloud TPUs for pod default/t2ttpu-master-0.

kubectl describe pod t2ttpu-master-0

Once it has Running status, view the logs to watch training commence:

kubectl logs -f t2ttpu-master-0 | grep INFO:tensorflow

4. Create the serving and UI components

Retrieve the following files for the serving & UI components:

cp ${DEMO_REPO}/demo/components/serving.* components
cp ${DEMO_REPO}/demo/components/ui.* components

Create the serving and UI components:

ks apply default -c serving -c ui

Connect to the UI by forwarding a port to the ambassador service:

kubectl port-forward svc/ambassador 8080:80

Optional: If necessary, setup an SSH tunnel from your local laptop into the compute instance connecting to GKE:

ssh ${HOST} -L 8080:localhost:8080

To show the naive version, navigate to localhost:8080/kubeflow_demo/ from a browser.

To show the ML version, navigate to localhost:8080/kubeflow_demo/kubeflow from a browser.

5. Bring up a notebook

Open a browser and connect to the Central Dashboard at localhost:8080/. Show the TF-job dashboard, then click on Jupyterhub. Log in with any username and password combination and wait until the page refreshes. Spawn a new pod with these resource requirements:

Resource Value
Image gcr.io/kubeflow-images-public/tensorflow-1.7.0-notebook-gpu:v0.2.1
CPU 2
Memory 48G
Extra Resource Limits {"nvidia.com/gpu":2}

It will take a while for the pod to spawn. While you're waiting, watch for autoprovisioning to occur. View the Workload and Node status in the GCP console.

Once the notebook environment is available, open a new terminal and upload this Yelp notebook.

Ensure the kernel is set to Python 2, then execute the notebook.

6. Run a simple pipeline

Show the file gpu-example-pipeline.py as an example of a simple pipeline.

Compile it to create a .tar.gz file:

./gpu-example-pipeline.py

View the pipelines UI locally by forwarding a port to the ml-pipeline-ui pod:

kubectl port-forward svc/ml-pipeline-ui 8081:80

In the browser, navigate to localhost:8081 and create a new pipeline by uploading gpu-example-pipeline.py.tar.gz. Select the pipeline and click Create experiment. Use all suggested defaults.

View the effects of autoprovisioning by observing the number of nodes increase.

Select Experiments from the left-hand side, then Runs. Click on the experiment run to view the graph and watch it execute.

View the container logs for the training step and take note of the low accuracy (~0.113).

7. Perform hyperparameter tuning

In order to determine parameters that result in higher accuracy, use Katib to execute a Study, which defines a search space for performing training with a range of different parameters.

Create a Study by applying an example file to the cluster:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/gpu-example.yaml

This creates a Studyjob object. To view it:

kubectl get studyjob
kubectl describe studyjobs gpu-example

To view the Katib UI, connect to the modeldb-frontend pod:

kubectl port-forward svc/katib-ui 8082:80

In the browser, navigate to localhost:8082/katib and click on the gpu-example project. In the Explore Visualizations section, select Optimizer in the Group By dropdown, then click Compare.

View the creation of a new GPU node pool:

gcloud container node-pools list --cluster ${CLUSTER}

View the creation of new nodes:

kubectl get nodes

In the Katib UI, interact with the various graphs to determine which combination of parameters results in the highest accuracy. Grouping by optimizer type is one way to find consistently higher accuracies. Gather a set of parameters to use in a new run of the pipeline.

8. Run a better pipeline

In the pipelines UI, clone the previous experiment run and update the arguments to match the parameters for one of the runs with higher accuracies from the Katib UI. Execute the pipeline and watch for the resulting accuracy, which should be closer to 0.98.

Approximately 5 minutes after the last run completes, check the cluster nodes to verify that GPU nodes have disappeared.

9. Cleanup

From the application directory created by kfctl, issue a cleanup command:

kfctl delete k8s

The cluster will scale back down to the default node pool, removing all nodes created by NAP.