You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
I have no problem uploading small files, but when I upload this book, which is quite large (1088k), it shows that the upload was successful and Graphrag can find the book, but it fails to load.
Steps to reproduce
uploading a larger book. For testing, I'm using a 1088k file with 585,696 characters, encoded in UTF-8.
Expected Behavior
support large book indexing
GraphRAG Config Used
This config file contains required core defaults that must be set, along with a handful of common optional settings.
There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
encoding_model: cl100k_base # this needs to be matched to your model!
llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini
model_supports_json: true # recommended if this is available for your model.
claim_extraction:
enabled: false
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
10:26:28,71 graphrag.index.create_pipeline_config INFO skipping workflows
10:26:28,71 graphrag.index.run.run INFO Running pipeline
10:26:28,72 graphrag.storage.file_pipeline_storage INFO Creating file storage at /home/emotionalrag/graphrag/incremental_ragtest/output
10:26:28,72 graphrag.index.input.factory INFO loading input from root_dir=input
10:26:28,72 graphrag.index.input.factory INFO using file storage for input
10:26:28,73 graphrag.storage.file_pipeline_storage INFO search /home/emotionalrag/graphrag/incremental_ragtest/input for files matching .*.txt$
10:26:28,74 graphrag.index.input.text INFO found text files from input, found [('xizang_history.txt', {})]
10:26:28,80 graphrag.index.input.text WARNING Warning! Error loading file xizang_history.txt. Skipping...
10:26:28,80 graphrag.index.input.text INFO Found 1 files, loading 0
10:26:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_final_documents', 'extract_graph', 'compute_communities', 'create_final_entities', 'create_final_relationships', 'create_final_communities', 'create_final_nodes', 'create_final_text_units', 'create_final_community_reports', 'generate_text_embeddings']
10:26:28,82 graphrag.index.run.run INFO Final # of rows loaded: 0
10:26:28,238 graphrag.index.run.workflow INFO dependencies for create_base_text_units: []
10:26:28,243 datashaper.workflow.workflow INFO executing verb create_base_text_units
10:26:28,243 datashaper.workflow.workflow ERROR Error executing verb "create_base_text_units" in create_base_text_units: 'id'
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
10:26:28,248 graphrag.callbacks.file_workflow_callbacks INFO Error executing verb "create_base_text_units" in create_base_text_units: 'id' details=None
10:26:28,248 graphrag.index.run.run ERROR error running workflow create_base_text_units
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/run.py", line 262, in run_pipeline
result = await _process_workflow(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/workflow.py", line 103, in _process_workflow
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
Additional Information
GraphRAG Version:1.0.0
Operating System:ubantu
Python Version:3.12
Related Issues:
The text was updated successfully, but these errors were encountered:
hope12122
added
bug
Something isn't working
triage
Default label assignment, indicates new issue needs reviewed by a maintainer
labels
Jan 15, 2025
Do you need to file an issue?
Describe the bug
I have no problem uploading small files, but when I upload this book, which is quite large (1088k), it shows that the upload was successful and Graphrag can find the book, but it fails to load.
Steps to reproduce
uploading a larger book. For testing, I'm using a 1088k file with 585,696 characters, encoded in UTF-8.
Expected Behavior
support large book indexing
GraphRAG Config Used
This config file contains required core defaults that must be set, along with a handful of common optional settings.
For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
LLM settings
There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
encoding_model: cl100k_base # this needs to be matched to your model!
llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini
model_supports_json: true # recommended if this is available for your model.
audience: "https://cognitiveservices.azure.com/.default"
api_base: https://.openai.azure.com
api_version: 2024-02-15-preview
organization: <organization_id>
deployment_name: <azure_model_deployment_name>
parallelization:
stagger: 0.3
num_threads: 50
async_mode: threaded # or asyncio
embeddings:
async_mode: threaded # or asyncio
vector_store:
type: lancedb
db_uri: 'output/lancedb'
container_name: default
overwrite: true
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://.openai.azure.com
# api_version: 2024-02-15-preview
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
Input settings
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
Storage settings
If blob storage is specified in the following four sections,
connection_string and container_name must be provided
cache:
type: file # or blob
base_dir: "cache"
reporting:
type: file # or console, blob
base_dir: "logs"
storage:
type: file # or blob
base_dir: "output"
only turn this on if running
graphrag index
with custom settingswe normally use
graphrag update
with the defaultsupdate_index_storage:
#type: file # or blob
#base_dir: "vv"
Workflow settings
skip_workflows: []
entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 1
summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
enabled: false
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
embeddings: false
transient: false
Query settings
The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
prompt: "prompts/local_search_system_prompt.txt"
global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
prompt: "prompts/drift_search_system_prompt.txt"
Logs and screenshots
10:26:28,71 graphrag.index.create_pipeline_config INFO skipping workflows
10:26:28,71 graphrag.index.run.run INFO Running pipeline
10:26:28,72 graphrag.storage.file_pipeline_storage INFO Creating file storage at /home/emotionalrag/graphrag/incremental_ragtest/output
10:26:28,72 graphrag.index.input.factory INFO loading input from root_dir=input
10:26:28,72 graphrag.index.input.factory INFO using file storage for input
10:26:28,73 graphrag.storage.file_pipeline_storage INFO search /home/emotionalrag/graphrag/incremental_ragtest/input for files matching .*.txt$
10:26:28,74 graphrag.index.input.text INFO found text files from input, found [('xizang_history.txt', {})]
10:26:28,80 graphrag.index.input.text WARNING Warning! Error loading file xizang_history.txt. Skipping...
10:26:28,80 graphrag.index.input.text INFO Found 1 files, loading 0
10:26:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_final_documents', 'extract_graph', 'compute_communities', 'create_final_entities', 'create_final_relationships', 'create_final_communities', 'create_final_nodes', 'create_final_text_units', 'create_final_community_reports', 'generate_text_embeddings']
10:26:28,82 graphrag.index.run.run INFO Final # of rows loaded: 0
10:26:28,238 graphrag.index.run.workflow INFO dependencies for create_base_text_units: []
10:26:28,243 datashaper.workflow.workflow INFO executing verb create_base_text_units
10:26:28,243 datashaper.workflow.workflow ERROR Error executing verb "create_base_text_units" in create_base_text_units: 'id'
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
10:26:28,248 graphrag.callbacks.file_workflow_callbacks INFO Error executing verb "create_base_text_units" in create_base_text_units: 'id' details=None
10:26:28,248 graphrag.index.run.run ERROR error running workflow create_base_text_units
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/run.py", line 262, in run_pipeline
result = await _process_workflow(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/workflow.py", line 103, in _process_workflow
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
Additional Information
The text was updated successfully, but these errors were encountered: