Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel SKLearn Optimization Doc Update for 10/18 webinar #115443

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions articles/machine-learning/how-to-train-scikit-learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,11 @@ Next, use the YAML file to create and register this custom environment in your w

For more information on creating and using environments, see [Create and use software environments in Azure Machine Learning](how-to-use-environments.md).

##### [Optional] Create a custom environment with Intel® Extension for Scikit-Learn

Want to speed up your scikit-learn scripts on Intel hardware? Try adding [Intel® Extension for Scikit-Learn](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html) into your conda yaml file and following the subsequent steps detailed above. We will show you how to enable these optimizations later in this example:
[!notebook-python[](~/azureml-examples-main/sdk/python/jobs/single-step/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train-hyperparameter-tune-with-sklearn.ipynb?name=make_sklearnex_conda_file)]

## Configure and submit your training job

In this section, we'll cover how to run a training job, using a training script that we've provided. To begin, you'll build the training job by configuring the command for running the training script. Then, you'll submit the training job to run in Azure Machine Learning.
Expand All @@ -132,6 +137,14 @@ Next, create the script file in the source directory.

[!notebook-python[](~/azureml-examples-main/sdk/python/jobs/single-step/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train-hyperparameter-tune-with-sklearn.ipynb?name=create_script_file)]

#### [Optional] Enable Intel® Extension for Scikit-Learn optimizations for more performance on Intel hardware

If you have installed Intel® Extension for Scikit-Learn (as demonstrated in the previous section), you can enable the performance optimizations by adding the two lines of code to the top of the script file, as shown below.

To learn more about Intel® Extension for Scikit-Learn, visit the package's [documentation](https://intel.github.io/scikit-learn-intelex/).

[!notebook-python[](~/azureml-examples-main/sdk/python/jobs/single-step/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train-hyperparameter-tune-with-sklearn.ipynb?name=create_sklearnex_script_file)]

### Build the training job

Now that you have all the assets required to run your job, it's time to build it using the Azure Machine Learning Python SDK v2. For this, we'll be creating a `command`.
Expand Down
121 changes: 121 additions & 0 deletions articles/machine-learning/tutorial-azure-ml-in-a-day.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,127 @@ You might need to select **Refresh** to see the new folder and script in your **

:::image type="content" source="media/tutorial-azure-ml-in-a-day/refresh.png" alt-text="Screenshot shows the refresh icon.":::

### [Optional] Enable Intel® Extension for Scikit-Learn optimizations for more performance on Intel hardware

Want to speed up your scikit-learn scripts on Intel hardware? Try enabling [Intel® Extension for Scikit-Learn](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html) in your training script. Intel® Extension for Scikit-Learn is already installed in the Azure Machine Learning curated environment used in this tutorial, so no additional installation is needed.

To learn more about Intel® Extension for Scikit-Learn, visit the package's [documentation](https://intel.github.io/scikit-learn-intelex/).

If you want to use Intel® Extension for Scikit-Learn as part of the training script described above, you can enable the performance optimizations by adding the two lines of code to the top of the script file, as shown below.


```python
%%writefile {train_src_dir}/main.py
import os
import argparse

# Import and enable Intel Extension for Scikit-learn optimizations
# where possible
from sklearnex import patch_sklearn
patch_sklearn()

import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

def main():
"""Main function of the script."""

# input and output arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="path to input data")
parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
parser.add_argument("--n_estimators", required=False, default=100, type=int)
parser.add_argument("--learning_rate", required=False, default=0.1, type=float)
parser.add_argument("--registered_model_name", type=str, help="model name")
args = parser.parse_args()

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.sklearn.autolog()

###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

print("input data:", args.data)

credit_df = pd.read_csv(args.data, header=1, index_col=0)

mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)

train_df, test_df = train_test_split(
credit_df,
test_size=args.test_train_ratio,
)
####################
#</prepare the data>
####################

##################
#<train the model>
##################
# Extracting the label column
y_train = train_df.pop("default payment next month")

# convert the dataframe values to array
X_train = train_df.values

# Extracting the label column
y_test = test_df.pop("default payment next month")

# convert the dataframe values to array
X_test = test_df.values

print(f"Training with data of shape {X_train.shape}")

clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
###################
#</train the model>
###################

##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=clf,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)

# Saving the model to a file
mlflow.sklearn.save_model(
sk_model=clf,
path=os.path.join(args.registered_model_name, "trained_model"),
)
###########################
#</save and register model>
###########################

# Stop Logging
mlflow.end_run()

if __name__ == "__main__":
main()
```


## Create a compute cluster, a scalable way to run a training job

> [!NOTE]
Expand Down