diff --git a/data-load.ipynb b/data-load.ipynb
index c6d8eca..65bf02b 100644
--- a/data-load.ipynb
+++ b/data-load.ipynb
@@ -8,29 +8,23 @@
"# Neo4j Generative AI - Data Loading\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neo4j-product-examples/genai-workshop/blob/main/data-load.ipynb)\n",
"\n",
- "This workshop will teach you to how to use Neo4j for Graph-Powered Retrieval-Augmented Generation (GraphRAG) to enhance GenAI and improve response quality for real-world applications.\n",
+ "This workshop will teach you how to use Neo4j for Graph-Powered Retrieval-Augmented Generation (GraphRAG) to enhance GenAI and improve response quality for real-world applications.\n",
"\n",
"GenAI, despite its potential, faces challenges like hallucination and lack of domain knowledge. GraphRAG addresses these issues by combining vector search with knowledge graphs and data science techniques. This integration helps improve context, semantic understanding, and personalization, making Large Language Models (LLMs) more effective for critical applications.\n",
"\n",
"We walk through an example that uses real-world customer and product data from a fashion, style, and beauty retailer. We show how you can use a knowledge graph to ground an LLM, enabling it to build tailored marketing content personalized to each customer based on their interests and shared purchase histories. We use Retrieval-Augmented Generation (RAG) to accomplish this, specifically leveraging not just vector search but also graph pattern matching and graph machine learning to provide more relevant personalized results to customers. We call this graph-powered RAG approach “GraphRAG” for short.\n",
"\n",
"This notebook walks through the first steps of the process, including:\n",
- "- Building the knowledge graph and generating text embeddings from scratch \n",
+ "- Building the knowledge graph and\n",
+ "- generating text embeddings from scratch\n",
"\n",
- "*If you would rather start from a database dump and skip data loading, please skip to [genai-workshop.ipynb](https://github.com/neo4j-product-examples/genai-workshop/blob/main/genai-workshop.ipynb)* \n",
+ "[genai-workshop.ipynb](https://github.com/neo4j-product-examples/genai-workshop/blob/main/genai-workshop.ipynb) contains the rest of the workshop including\n",
+ " - Vector search\n",
+ " - Graph patterns to improve semantic search\n",
+ " - Augmenting semantic search with graph data science\n",
+ " - Building an example LLM chain and demo app\n",
"\n",
- "\n",
- "\n",
- "Please also see the following companion notebooks to complete the workshop: \n",
- "\n",
- "- [genai-workshop.ipynb](https://github.com/neo4j-product-examples/genai-workshop/blob/main/genai-workshop.ipynb)\n",
- " - Vector search \n",
- " - Using graph patterns in Cypher to improve semantic search with context\n",
- " - Further augmenting semantic search with knowledge graph inference & graph data science \n",
- " - Building the LLM chain and demo app for generating content \n",
- " \n",
- "- [genai-example-app-only](https://github.com/neo4j-product-examples/genai-workshop/blob/main/genai-workshop-app-only.ipynb)\n",
- " - Building the LLM chain and demo app for generating content \n",
+ "If you would rather start from a database dump and skip this data loading, you can do so using [this dump file](https://storage.googleapis.com/gds-training-materials/Version8_Jan2024/neo4j_genai_hnm.dump).\n",
" "
]
},
@@ -39,7 +33,7 @@
"id": "a527357f-a5e8-42c9-8966-f37846097c1d",
"metadata": {},
"source": [
- "### Some Logics\n",
+ "### Some Logistics\n",
"1. Run the pip install below to get the necessary dependencies. this can take a while. Then run the following cell to import relevant libraries\n",
"2. You will need a Neo4j database environment with the [graph data science library](https://neo4j.com/docs/graph-data-science/current/installation) installed e.g. \n",
" - [AuraDS](https://neo4j.com/docs/aura/aurads/) \n",
@@ -54,12 +48,12 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"%%capture\n",
- "%pip install sentence_transformers langchain langchain-openai langchain_community openai tiktoken python-dotenv gradio graphdatascience altair neo4j_tools\n",
+ "%pip install sentence_transformers langchain langchain-openai langchain_community openai tiktoken python-dotenv gradio graphdatascience\n",
"%pip install \"vegafusion[embed]\""
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "code",
@@ -68,14 +62,12 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from dotenv import load_dotenv\n",
"import os\n",
"from graphdatascience import GraphDataScience\n",
- "from neo4j_tools import gds_db_load, gds_utils\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain.vectorstores.neo4j_vector import Neo4jVector\n",
"from langchain.graphs import Neo4jGraph\n",
@@ -83,7 +75,8 @@
"from langchain.schema import StrOutputParser\n",
"from langchain.schema.runnable import RunnableLambda\n",
"import gradio as gr"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "code",
@@ -92,12 +85,12 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"pd.set_option('display.max_rows', 10)\n",
"pd.set_option('display.max_colwidth', 500)\n",
"pd.set_option('display.width', 0)"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -114,7 +107,7 @@
"\n",
"To make this easy, you can write the credentials and env variables directly into the below cell.\n",
"\n",
- "Alternatively, if you like, you can use an environments file. This is a best practice for the future, but fine to skip if you're just exploring."
+ "Alternatively, if you like, you can use an environment file. This is a best practice for the future, but fine to skip for now."
]
},
{
@@ -124,7 +117,6 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"# Neo4j\n",
"NEO4J_URI = 'copy_paste_your_db_uri_here' #change this\n",
@@ -136,7 +128,8 @@
"LLM = 'gpt-4o'\n",
"os.environ['OPENAI_API_KEY'] = 'sk-...' #change this\n",
"OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "code",
@@ -145,7 +138,6 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"# You can skip this cell if not using a ws.env file - alternative to above\n",
"from dotenv import load_dotenv\n",
@@ -163,7 +155,8 @@
" # AI\n",
" LLM = 'gpt-4o'\n",
" OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -202,7 +195,6 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"# Use Neo4j URI and credentials according to our setup\n",
"gds = GraphDataScience(\n",
@@ -212,7 +204,8 @@
"\n",
"# Necessary if you enabled Arrow on the db - this is true for AuraDS\n",
"gds.set_database(\"neo4j\")"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -229,21 +222,10 @@
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'2.9.0+73'"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
"source": [
"gds.version()"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -254,7 +236,7 @@
"This workshop will leverage the [H&M Personalized Fashion Recommendations Dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data), a sample of real customer purchase data that includes rich information around products including names, types, descriptions, department sections, etc.\n",
"\n",
"*Bonus!*\n",
- "The data we use is a slightly sampled and preformatted version of the kaggle data. If you are interested in what we did, you can find the details [here](https://github.com/neo4j-product-examples/genai-workshop/blob/main/data-prep.ipynb)"
+ "The data we use is a sampled and preformatted version of the Kaggle data. If you are interested in what we did, you can find the details [here](https://github.com/neo4j-product-examples/genai-workshop/blob/main/data-prep.ipynb)"
]
},
{
@@ -264,7 +246,6 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"import pandas as pd\n",
"\n",
@@ -274,7 +255,8 @@
"article_df = pd.read_csv('https://storage.googleapis.com/neo4j-workshop-data/genai-hm/article.csv')\n",
"customer_df = pd.read_csv('https://storage.googleapis.com/neo4j-workshop-data/genai-hm/customer.csv')\n",
"transaction_df = pd.read_csv('https://storage.googleapis.com/neo4j-workshop-data/genai-hm/transaction.csv')"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -293,53 +275,14 @@
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "Empty DataFrame\n",
- "Columns: []\n",
- "Index: []"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
"source": [
"# create constraints - one uniqueness constraint for each node label\n",
"gds.run_cypher('CREATE CONSTRAINT unique_department_no IF NOT EXISTS FOR (n:Department) REQUIRE n.departmentNo IS UNIQUE')\n",
"gds.run_cypher('CREATE CONSTRAINT unique_product_code IF NOT EXISTS FOR (n:Product) REQUIRE n.productCode IS UNIQUE')\n",
"gds.run_cypher('CREATE CONSTRAINT unique_article_id IF NOT EXISTS FOR (n:Article) REQUIRE n.articleId IS UNIQUE')\n",
"gds.run_cypher('CREATE CONSTRAINT unique_customer_id IF NOT EXISTS FOR (n:Customer) REQUIRE n.customerId IS UNIQUE')"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -361,7 +304,6 @@
"metadata": {
"tags": []
},
- "outputs": [],
"source": [
"from typing import Tuple, Union\n",
"from numpy.typing import ArrayLike\n",
@@ -453,7 +395,8 @@
" res = gds.run_cypher(query, params={'recs': recs})\n",
" cumulative_count += res.iloc[0, 0]\n",
" print(f'Loaded {cumulative_count:,} of {total:,} relationships')"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "code",
@@ -462,147 +405,6 @@
"metadata": {
"tags": []
},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "====== loading Department nodes ======\n",
- "staging 266 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "UNWIND $recs AS rec\n",
- "MERGE(n:Department {departmentNo: rec.departmentNo})\n",
- "SET n.departmentName = rec.departmentName, n.sectionNo = rec.sectionNo, n.sectionName = rec.sectionName\n",
- "RETURN count(n) AS nodeLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 266 of 266 nodes\n",
- "====== loading Article nodes ======\n",
- "staging 13,351 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "UNWIND $recs AS rec\n",
- "MERGE(n:Article {articleId: rec.articleId})\n",
- "SET n.prodName = rec.prodName, n.productTypeName = rec.productTypeName, n.graphicalAppearanceNo = rec.graphicalAppearanceNo, n.graphicalAppearanceName = rec.graphicalAppearanceName, n.colourGroupCode = rec.colourGroupCode, n.colourGroupName = rec.colourGroupName\n",
- "RETURN count(n) AS nodeLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 10,000 of 13,351 nodes\n",
- "Loaded 13,351 of 13,351 nodes\n",
- "====== loading Product nodes ======\n",
- "staging 8,044 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "UNWIND $recs AS rec\n",
- "MERGE(n:Product {productCode: rec.productCode})\n",
- "SET n.prodName = rec.prodName, n.productTypeNo = rec.productTypeNo, n.productTypeName = rec.productTypeName, n.productGroupName = rec.productGroupName, n.garmentGroupNo = rec.garmentGroupNo, n.garmentGroupName = rec.garmentGroupName, n.detailDesc = rec.detailDesc\n",
- "RETURN count(n) AS nodeLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 8,044 of 8,044 nodes\n",
- "====== loading Customer nodes ======\n",
- "staging 1,000 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "UNWIND $recs AS rec\n",
- "MERGE(n:Customer {customerId: rec.customerId})\n",
- "SET n.fn = rec.fn, n.active = rec.active, n.clubMemberStatus = rec.clubMemberStatus, n.fashionNewsFrequency = rec.fashionNewsFrequency, n.age = rec.age, n.postalCode = rec.postalCode\n",
- "RETURN count(n) AS nodeLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 1,000 of 1,000 nodes\n",
- "====== loading FROM_DEPARTMENT relationships ======\n",
- "staging 13,351 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "\tUNWIND $recs AS rec\n",
- " MATCH(s:Article {articleId: rec.articleId})\n",
- " MATCH(t:Department {departmentNo: rec.departmentNo})\n",
- "\tMERGE(s)-[r:FROM_DEPARTMENT]->(t)\n",
- "\tRETURN count(r) AS relLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 10,000 of 13,351 relationships\n",
- "Loaded 13,351 of 13,351 relationships\n",
- "====== loading VARIANT_OF relationships ======\n",
- "staging 13,351 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "\tUNWIND $recs AS rec\n",
- " MATCH(s:Article {articleId: rec.articleId})\n",
- " MATCH(t:Product {productCode: rec.productCode})\n",
- "\tMERGE(s)-[r:VARIANT_OF]->(t)\n",
- "\tRETURN count(r) AS relLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 10,000 of 13,351 relationships\n",
- "Loaded 13,351 of 13,351 relationships\n",
- "====== loading PURCHASED relationships ======\n",
- "staging 23,199 records\n",
- "\n",
- "Using This Cypher Query:\n",
- "```\n",
- "\tUNWIND $recs AS rec\n",
- " MATCH(s:Customer {customerId: rec.customerId})\n",
- " MATCH(t:Article {articleId: rec.articleId})\n",
- "\tMERGE(s)-[r:PURCHASED {txId: rec.txId}]->(t)\n",
- "\tSET r.tDat = rec.tDat, r.price = rec.price, r.salesChannelId = rec.salesChannelId\n",
- "\tRETURN count(r) AS relLoadedCount\n",
- "```\n",
- "\n",
- "Loaded 10,000 of 23,199 relationships\n",
- "Loaded 20,000 of 23,199 relationships\n",
- "Loaded 23,199 of 23,199 relationships\n",
- "CPU times: user 1.3 s, sys: 110 ms, total: 1.41 s\n",
- "Wall time: 11.2 s\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "Empty DataFrame\n",
- "Columns: []\n",
- "Index: []"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
"source": [
"%%time\n",
"\n",
@@ -653,7 +455,8 @@
"MATCH(p:Product)\n",
"SET p.url = 'https://representative-domain/product/' + p.productCode\n",
"\"\"\")"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -664,7 +467,7 @@
"source": [
"## Creating Text Embeddings & Vector Index\n",
"\n",
- "Now the data has been loaded, we need to generate text embeddings on our product nodes to support Vector Search\n",
+ "Now that the data has been loaded, we need to generate text embeddings on our product nodes to support Vector Search\n",
"\n",
"Neo4j has native integrations with popular embedding APIs (OpenAI, Vertex AI, Amazon Bedrock, Azure OpenAI) making it possible to generate embeddings with a single Cypher query using `genai.vector.*` operations*.\n",
"\n",
@@ -672,9 +475,7 @@
"1. Matches every Product that has a detailed description\n",
"2. Uses the `collect` aggregation function to batch products into a set number of partitions\n",
"3. Encodes the text property in batches using OpenAI `text-embedding-ada-002`\n",
- "4. Sets the embedding as a vector property using `db.create.setNodeVectorProperty`. This special function is used to set the properties as floats rather than double precision, which requires more space. This becomes important as these embedding vectors tend to be long, and the size can add up quickly.\n",
- "\n",
- "*NOTE: `genai.vector.*` operations are not available in Neo4j Community Edition. For Neo4j Community Edition you will need to generate embeddings externally and ingest them into Neo4j."
+ "4. Sets the embedding as a vector property using `db.create.setNodeVectorProperty`. This special function is used to set the properties as floats rather than double precision, which requires more space. This becomes important as these embedding vectors tend to be long, and the size can add up quickly."
]
},
{
@@ -684,46 +485,6 @@
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "Empty DataFrame\n",
- "Columns: []\n",
- "Index: []"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
"source": [
"#generate embeddings\n",
"\n",
@@ -736,7 +497,8 @@
" YIELD index, vector\n",
" CALL db.create.setNodeVectorProperty(nodes[index], \"textEmbedding\", vector)\n",
"} IN TRANSACTIONS OF 1 ROW''', params={'token':OPENAI_API_KEY, 'numberOfBatches':100})"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -755,46 +517,6 @@
"metadata": {
"tags": []
},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "Empty DataFrame\n",
- "Columns: []\n",
- "Index: []"
- ]
- },
- "execution_count": 19,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
"source": [
"#create vector index\n",
"\n",
@@ -809,15 +531,16 @@
"\n",
"#wait for index to come online\n",
"gds.run_cypher('CALL db.awaitIndex(\"product_text_embeddings\", 300)')"
- ]
+ ],
+ "outputs": []
},
{
"cell_type": "markdown",
"id": "a0ba071f-a3d4-4959-8236-e4d39cec0c19",
"metadata": {},
"source": [
- "# Next Steps\n",
- "Try out a vector search with your newly made vector index, and learn how to enhance that search with graphs and graph data science! [genai-workshop.ipynb](https://github.com/neo4j-product-examples/genai-workshop/blob/main/genai-workshop.ipynb)\n"
+ "## Next Steps\n",
+ "Analyze the graph, try out a vector search, and learn how to enhance search with graphs and graph data science in [genai-workshop.ipynb](https://github.com/neo4j-product-examples/genai-workshop/blob/main/genai-workshop.ipynb)\n"
]
}
],