Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdown text splitter #1098

Merged
merged 8 commits into from
Feb 13, 2025
Merged

markdown text splitter #1098

merged 8 commits into from
Feb 13, 2025

Conversation

pelikhan
Copy link
Member

@pelikhan pelikhan commented Feb 6, 2025


Summary of Changes

  • ✂️ Markdown Chunking Utility Added: Introduced a new chunkMarkdown function to split markdown files into manageable chunks without breaking the heading structure. This supports efficient token-based chunking for large markdown files.

  • 🧪 Comprehensive Test Coverage: Added a robust test suite (mdchunk.test.ts) to validate the chunkMarkdown functionality, including edge cases like empty markdown, nested headings, large sections, and backtracking logic.

  • 🚀 Global Utility Enhancement: Extended the installGlobals method to include the chunk method under the global MD interface, enabling markdown chunking with optional parameters like maxTokens and model.

  • 📄 New Sample Script: Added a sample script (mdchunk.genai.mjs) demonstrating the usage of the chunk method to process markdown files with token limits.

  • 🛠️ Integration with File and Token Utilities: Incorporated resolveFileContent and resolveTokenEncoder to handle file content resolution and token encoding seamlessly during markdown chunking.

  • 📝 Type Definition Update: Updated the MD interface in prompt_template.d.ts to include the new chunk method, ensuring proper type support for markdown chunking.

These changes enhance the system's ability to handle large markdown files efficiently while maintaining structural integrity, improving both flexibility and usability.

AI-generated content by prd may be incorrect

@pelikhan pelikhan closed this Feb 8, 2025
@pelikhan pelikhan reopened this Feb 12, 2025
@pelikhan pelikhan merged commit ed6af76 into main Feb 13, 2025
14 checks passed
@pelikhan pelikhan deleted the mkchunk branch February 13, 2025 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant