Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metadata into text_chunks #1671

Closed
wants to merge 16 commits into from
Closed

add metadata into text_chunks #1671

wants to merge 16 commits into from

Conversation

dayesouza
Copy link
Contributor

Description

Add new metadata field to be added at the start of each chunk id, being considered as tokens in the max_tokens_per_chunk

Proposed Changes

  • Removed unused columns (source_column, timestamp_column, timestamp_format, title_column, document_attribute_columns) from input_config.py and replaced them with a new metadata field.
  • Updated load_file functions in csv.py and text.py to handle the new metadata field.
  • Modified chunk_text.py and strategies.py to include metadata and line delimiters in chunking operations.
  • Added a new configuration option GRAPHRAG_INPUT_DOCUMENT_METADATA to specify metadata columns in the env_vars.md and yaml.md documentation files.

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

@dayesouza dayesouza changed the title Feat/metadata add metadata into text_chunks Jan 31, 2025
@dayesouza dayesouza closed this Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant