From 724817d3d7639a3d089cc371bec4ec2d098cc0b3 Mon Sep 17 00:00:00 2001 From: kennethmhc Date: Tue, 15 Oct 2024 10:36:17 +0200 Subject: [PATCH] [FSTORE-1509] Feature logging (#276) * fraud batch logging * fix online fraud * async feature logging * online logging * fix batch * log online model * remove redundant import * use hw model --- .../1_fraud_batch_feature_pipeline.ipynb | 78 +++++++-------- .../2_fraud_batch_training_pipeline.ipynb | 6 +- fraud_batch/3_fraud_batch_inference.ipynb | 99 ++++++++++++++----- .../2_fraud_online_training_pipeline.ipynb | 28 ++++-- .../3_fraud_online_inference_pipeline.ipynb | 59 ++++++++++- 5 files changed, 194 insertions(+), 76 deletions(-) diff --git a/fraud_batch/1_fraud_batch_feature_pipeline.ipynb b/fraud_batch/1_fraud_batch_feature_pipeline.ipynb index 1c9767d8..0e731f67 100644 --- a/fraud_batch/1_fraud_batch_feature_pipeline.ipynb +++ b/fraud_batch/1_fraud_batch_feature_pipeline.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "f60c52ce", + "id": "5d5afc66", "metadata": { "tags": [] }, @@ -27,7 +27,7 @@ }, { "cell_type": "markdown", - "id": "7ec7724a", + "id": "d20eee58", "metadata": {}, "source": [ "## 📝 Imports" @@ -36,7 +36,7 @@ { "cell_type": "code", "execution_count": null, - "id": "42efa933", + "id": "802d6425", "metadata": {}, "outputs": [], "source": [ @@ -46,7 +46,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a4120a95", + "id": "bd55a691", "metadata": {}, "outputs": [], "source": [ @@ -64,7 +64,7 @@ }, { "cell_type": "markdown", - "id": "010286f1", + "id": "879a539d", "metadata": {}, "source": [ "## 💽 Loading the Data \n", @@ -84,7 +84,7 @@ { "cell_type": "code", "execution_count": null, - "id": "491a37b4", + "id": "586d4111", "metadata": {}, "outputs": [], "source": [ @@ -100,7 +100,7 @@ { "cell_type": "code", "execution_count": null, - "id": "9ed8c4f0", + "id": "0cdb9c3d", "metadata": {}, "outputs": [], "source": [ @@ -118,7 +118,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2543c511", + "id": "5a6fb4f4", "metadata": {}, "outputs": [], "source": [ @@ -135,7 +135,7 @@ }, { "cell_type": "markdown", - "id": "717f9102", + "id": "fbb301ce", "metadata": {}, "source": [ "---" @@ -143,7 +143,7 @@ }, { "cell_type": "markdown", - "id": "16357302", + "id": "bf6f9299", "metadata": {}, "source": [ "## 🛠️ Feature Engineering \n", @@ -158,7 +158,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0b919bf8", + "id": "d1af85e4", "metadata": {}, "outputs": [], "source": [ @@ -181,7 +181,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5d6fd9ae", + "id": "4956cb52", "metadata": {}, "outputs": [], "source": [ @@ -191,7 +191,7 @@ }, { "cell_type": "markdown", - "id": "92820fab", + "id": "bc4e505f", "metadata": {}, "source": [ "Next, you will create features that for each credit card aggregate data from multiple time steps.\n", @@ -203,7 +203,7 @@ { "cell_type": "code", "execution_count": null, - "id": "47116f13", + "id": "ee689954", "metadata": {}, "outputs": [], "source": [ @@ -223,7 +223,7 @@ }, { "cell_type": "markdown", - "id": "0b3abcfc", + "id": "ed7039bb", "metadata": {}, "source": [ "Next lets compute windowed aggregates. Here you will use 4-hour windows, but feel free to experiment with different window lengths by setting `window_len` below to a value of your choice." @@ -232,7 +232,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f9bf9dd7", + "id": "791708c4", "metadata": {}, "outputs": [], "source": [ @@ -248,7 +248,7 @@ }, { "cell_type": "markdown", - "id": "b3021fa4", + "id": "391e478f", "metadata": {}, "source": [ "### ⚙️ Convert date time object to unix epoch in milliseconds " @@ -257,7 +257,7 @@ { "cell_type": "code", "execution_count": null, - "id": "837bafd5", + "id": "bc6bcae3", "metadata": {}, "outputs": [], "source": [ @@ -270,7 +270,7 @@ }, { "cell_type": "markdown", - "id": "a3df26a9", + "id": "bc3fec23", "metadata": {}, "source": [ "## 👮🏻‍♂️ Great Expectations " @@ -279,7 +279,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c9391579", + "id": "cbc3ca3c", "metadata": {}, "outputs": [], "source": [ @@ -299,7 +299,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8969ff27", + "id": "b3891fe5", "metadata": {}, "outputs": [], "source": [ @@ -340,7 +340,7 @@ }, { "cell_type": "markdown", - "id": "adf03efd", + "id": "da2f174f", "metadata": {}, "source": [ "---" @@ -348,7 +348,7 @@ }, { "cell_type": "markdown", - "id": "7c069f5a", + "id": "c48bbff9", "metadata": {}, "source": [ "## 📡 Connecting to Hopsworks Feature Store " @@ -356,7 +356,7 @@ }, { "cell_type": "markdown", - "id": "ac32437d", + "id": "4d68f207", "metadata": { "tags": [] }, @@ -373,7 +373,7 @@ { "cell_type": "code", "execution_count": null, - "id": "287491fa", + "id": "4c259c35", "metadata": {}, "outputs": [], "source": [ @@ -386,7 +386,7 @@ }, { "cell_type": "markdown", - "id": "23b61089", + "id": "0f80ccc5", "metadata": {}, "source": [ "To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group and a version number, if it is not defined it will automatically be incremented to `1`." @@ -395,7 +395,7 @@ { "cell_type": "code", "execution_count": null, - "id": "58353322", + "id": "c78c614a", "metadata": {}, "outputs": [], "source": [ @@ -412,7 +412,7 @@ }, { "cell_type": "markdown", - "id": "832333fa", + "id": "45fcf5c3", "metadata": {}, "source": [ "A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n", @@ -423,7 +423,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7f82e4a2", + "id": "e97b5aab", "metadata": {}, "outputs": [], "source": [ @@ -435,7 +435,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0c6832a9", + "id": "2f6cc0c0", "metadata": {}, "outputs": [], "source": [ @@ -462,7 +462,7 @@ }, { "cell_type": "markdown", - "id": "36f7e30a", + "id": "a57846b4", "metadata": {}, "source": [ "At the creation of the feature group, you will be prompted with an URL that will directly link to it; there you will be able to explore some of the aspects of your newly created feature group.\n", @@ -472,7 +472,7 @@ }, { "cell_type": "markdown", - "id": "1250c3ce", + "id": "777e6d4e", "metadata": {}, "source": [ "You can move on and do the same thing for the feature group with our windows aggregation." @@ -481,7 +481,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6b8ef9f6", + "id": "35ca5bbb", "metadata": {}, "outputs": [], "source": [ @@ -498,7 +498,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1970657b", + "id": "e16aa93e", "metadata": {}, "outputs": [], "source": [ @@ -510,7 +510,7 @@ { "cell_type": "code", "execution_count": null, - "id": "68b0c84f", + "id": "2d80f5e2", "metadata": {}, "outputs": [], "source": [ @@ -530,7 +530,7 @@ }, { "cell_type": "markdown", - "id": "e22dce87", + "id": "eb2254b9", "metadata": {}, "source": [ "Both feature groups are now accessible and searchable in the UI\n", @@ -540,7 +540,7 @@ }, { "cell_type": "markdown", - "id": "4f8b5d8a", + "id": "c6e783a2", "metadata": {}, "source": [ "## ⏭️ **Next:** Part 02: Training Pipeline\n", @@ -555,7 +555,7 @@ "hash": "e1ddeae6eefc765c17da80d38ea59b893ab18c0c0904077a035ef84cfe367f83" }, "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -569,7 +569,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.13" } }, "nbformat": 4, diff --git a/fraud_batch/2_fraud_batch_training_pipeline.ipynb b/fraud_batch/2_fraud_batch_training_pipeline.ipynb index a1252678..2bc7e8e4 100644 --- a/fraud_batch/2_fraud_batch_training_pipeline.ipynb +++ b/fraud_batch/2_fraud_batch_training_pipeline.ipynb @@ -186,7 +186,7 @@ " version=1,\n", " query=selected_features,\n", " labels=[\"fraud_label\"],\n", - " transformation_functions=transformation_functions,\n", + " transformation_functions=[label_encoder(\"category\")],\n", ")" ] }, @@ -482,7 +482,7 @@ "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" }, "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -496,7 +496,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.13" } }, "nbformat": 4, diff --git a/fraud_batch/3_fraud_batch_inference.ipynb b/fraud_batch/3_fraud_batch_inference.ipynb index 6b72e59d..3bcba98f 100644 --- a/fraud_batch/3_fraud_batch_inference.ipynb +++ b/fraud_batch/3_fraud_batch_inference.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "0be78559", + "id": "9cd70fcf", "metadata": {}, "source": [ "# **Hopsworks Feature Store** - Part 03: Batch Inference\n", @@ -16,7 +16,7 @@ }, { "cell_type": "markdown", - "id": "43743c5d", + "id": "c2921fbf", "metadata": {}, "source": [ "## 📝 Imports" @@ -25,7 +25,7 @@ { "cell_type": "code", "execution_count": null, - "id": "13f0abf3", + "id": "a8d22c2b", "metadata": {}, "outputs": [], "source": [ @@ -34,7 +34,7 @@ }, { "cell_type": "markdown", - "id": "6a310165", + "id": "585a3654", "metadata": {}, "source": [ "## 📡 Connecting to Hopsworks Feature Store " @@ -43,7 +43,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3fe3e577", + "id": "8c21b5bd", "metadata": {}, "outputs": [], "source": [ @@ -56,7 +56,7 @@ }, { "cell_type": "markdown", - "id": "f5b8fa76", + "id": "76088934", "metadata": {}, "source": [ "## ⚙️ Feature View Retrieval\n" @@ -65,7 +65,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4cd9db18", + "id": "a98bdc04", "metadata": {}, "outputs": [], "source": [ @@ -78,7 +78,7 @@ }, { "cell_type": "markdown", - "id": "0067bc06", + "id": "959dfccb", "metadata": {}, "source": [ "## 🗄 Model Registry\n" @@ -87,7 +87,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4e6b2247", + "id": "c41321c4", "metadata": {}, "outputs": [], "source": [ @@ -97,7 +97,7 @@ }, { "cell_type": "markdown", - "id": "05dcb148", + "id": "0c7294fa", "metadata": {}, "source": [ "## 🚀 Fetch and test the model\n", @@ -108,7 +108,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8a34352e", + "id": "3f7a9919", "metadata": {}, "outputs": [], "source": [ @@ -125,7 +125,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d27adabf", + "id": "30a49376", "metadata": {}, "outputs": [], "source": [ @@ -139,28 +139,29 @@ }, { "cell_type": "markdown", - "id": "ae9b9f92", + "id": "4e6ff0c5", "metadata": {}, "source": [ "---\n", - "## 🔮 Batch Prediction \n" + "## 🔮 Batch Prediction and Logging \n" ] }, { "cell_type": "code", "execution_count": null, - "id": "8fc906eb", + "id": "83ce0a67", "metadata": {}, "outputs": [], "source": [ "# Initialize batch scoring\n", "feature_view.init_batch_scoring(1)\n", "\n", - "# Get the batch data\n", - "batch_data = feature_view.get_batch_data()\n", + "# Get the untransformed and untransformed batch data for logging\n", + "untransformed_batch_data = feature_view.get_batch_data(transformed=False)\n", + "transformed_batch_data = feature_view.get_batch_data()\n", "\n", "# Drop the \"datetime\" column from the batch_data DataFrame along the specified axis (axis=1 means columns)\n", - "batch_data.drop([\"datetime\"], axis=1, inplace=True)\n", + "batch_data = transformed_batch_data.drop([\"datetime\"], axis=1)\n", "\n", "# Display the first 3 rows\n", "batch_data.head(3)" @@ -169,7 +170,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2e58817d", + "id": "7b04a49c", "metadata": {}, "outputs": [], "source": [ @@ -180,9 +181,55 @@ "predictions[:5]" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "33593079", + "metadata": {}, + "outputs": [], + "source": [ + "# log both transformed and untransformed features\n", + "feature_view.log(untransformed_batch_data.head(1000), predictions[:1000], training_dataset_version=1, model=retrieved_model)\n", + "feature_view.log(transformed_features=transformed_batch_data.head(1000), predictions=predictions[:1000], training_dataset_version=1, model=retrieved_model)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84cc599a", + "metadata": {}, + "outputs": [], + "source": [ + "# stop the job materialization schedule and materialize log manually\n", + "feature_view.pause_logging()\n", + "feature_view.materialize_log(wait=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b4f15472", + "metadata": {}, + "outputs": [], + "source": [ + "# read untransformed log\n", + "feature_view.read_log().head(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3434555d", + "metadata": {}, + "outputs": [], + "source": [ + "# read transformed log\n", + "feature_view.read_log(transformed=True).head(3)" + ] + }, { "cell_type": "markdown", - "id": "ee5fcba9", + "id": "f2c6dd73", "metadata": {}, "source": [ "---\n", @@ -194,11 +241,19 @@ "\n", "Or documentation at ➡ https://docs.hopsworks.ai" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2c16ff8", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -212,7 +267,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.13" } }, "nbformat": 4, diff --git a/fraud_online/2_fraud_online_training_pipeline.ipynb b/fraud_online/2_fraud_online_training_pipeline.ipynb index 91ec9b68..90f5a113 100644 --- a/fraud_online/2_fraud_online_training_pipeline.ipynb +++ b/fraud_online/2_fraud_online_training_pipeline.ipynb @@ -117,7 +117,7 @@ ")\n", "\n", "# Select features for training dataset\n", - "selected_features = trans_fg.select_all().join(profile_online_fg.select_all())" + "selected_features = trans_fg.select_all().join(profile_online_fg.select_except([\"cc_num\"]))" ] }, { @@ -192,7 +192,8 @@ " version=1,\n", " query=selected_features,\n", " labels=[\"fraud_label\"],\n", - " transformation_functions=transformation_functions,\n", + " transformation_functions=[label_encoder(\"country\"), label_encoder(\"gender\")],\n", + " logging_enabled=True\n", ")" ] }, @@ -561,7 +562,7 @@ "\n", "class Predict(object):\n", "\n", - " def __init__(self):\n", + " def __init__(self, async_logger, model):\n", " \"\"\" Initializes the serving state, reads a trained model\"\"\" \n", " # Get feature store handle\n", " fs_conn = hsfs.connection()\n", @@ -573,24 +574,31 @@ " version=1,\n", " )\n", " \n", - " # Initialize serving\n", - " self.fv.init_serving(1)\n", + " # Initialize serving and async feature logging\n", + " self.fv.init_serving(1, feature_logger=async_logger)\n", "\n", " # Load the trained model\n", + " self.hopsworks_model = model\n", " self.model = joblib.load(os.environ[\"ARTIFACT_FILES_PATH\"] + \"/xgboost_fraud_online_model.pkl\")\n", " print(\"Initialization Complete\")\n", "\n", " def predict(self, inputs):\n", " \"\"\" Serves a prediction request usign a trained model\"\"\"\n", - " feature_vector = self.fv.get_feature_vector({\"cc_num\": inputs[0][0]})\n", + " untransformed_feature_vector = self.fv.get_feature_vector({\"cc_num\": inputs[0][0]}, transform=False)\n", + " feature_vector = self.fv.transform(untransformed_feature_vector)\n", " indexes_to_remove = [0,1,2]\n", - " feature_vector = [\n", + " feature_vector_model = [\n", " i \n", " for j, i \n", " in enumerate(feature_vector) \n", " if j not in indexes_to_remove\n", " ] \n", - " return self.model.predict(np.asarray(feature_vector).reshape(1, -1)).tolist() # Numpy Arrays are not JSON serializable" + " prediction = self.model.predict(np.asarray(feature_vector_model).reshape(1, -1)).tolist() # Numpy Arrays are not JSON serializable\n", + " self.fv.log(untransformed_features=[untransformed_feature_vector],\n", + " transformed_features=[feature_vector],\n", + " predictions=[prediction],\n", + " model=self.hopsworks_model)\n", + " return prediction" ] }, { @@ -745,7 +753,7 @@ "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" }, "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -759,7 +767,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.13" } }, "nbformat": 4, diff --git a/fraud_online/3_fraud_online_inference_pipeline.ipynb b/fraud_online/3_fraud_online_inference_pipeline.ipynb index 1642b079..e819ed5c 100644 --- a/fraud_online/3_fraud_online_inference_pipeline.ipynb +++ b/fraud_online/3_fraud_online_inference_pipeline.ipynb @@ -168,6 +168,61 @@ "predictions" ] }, + { + "cell_type": "markdown", + "id": "03f145ab", + "metadata": {}, + "source": [ + "## 🔮 Read Prediction Log" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e0a8788", + "metadata": {}, + "outputs": [], + "source": [ + "feature_view = fs.get_feature_view(\n", + " name='transactions_fraud_online_fv',\n", + " version=1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2dedf684", + "metadata": {}, + "outputs": [], + "source": [ + "# stop the job materialization schedule and materialize log manually\n", + "feature_view.pause_logging()\n", + "feature_view.materialize_log(wait=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4450d0c2", + "metadata": {}, + "outputs": [], + "source": [ + "# read untransformed log\n", + "feature_view.read_log().head(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "445628bb", + "metadata": {}, + "outputs": [], + "source": [ + "# read transformed log\n", + "feature_view.read_log(transformed=True).head(3)" + ] + }, { "cell_type": "markdown", "id": "0b02e2bd", @@ -221,7 +276,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -235,7 +290,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.13" } }, "nbformat": 4,