From 1902e555eec1f35a360adbf53fdd30363eb9a30f Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Tue, 17 Dec 2024 04:45:08 +0000 Subject: [PATCH] Docs update (b08c452) --- docs/docs/concepts.mdx | 11 ++++++++++- docs/docs/how_to/index.mdx | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/docs/docs/concepts.mdx b/docs/docs/concepts.mdx index 6cc0f135bff28..3b2668fabe0d8 100644 --- a/docs/docs/concepts.mdx +++ b/docs/docs/concepts.mdx @@ -1038,13 +1038,22 @@ Table columns: |----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Recursive | [RecursiveCharacterTextSplitter](/docs/how_to/recursive_text_splitter/), [RecursiveJsonSplitter](/docs/how_to/recursive_json_splitter/) | A list of user defined characters | | Recursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the `recommended way` to start splitting text. | | HTML | [HTMLHeaderTextSplitter](/docs/how_to/HTML_header_metadata_splitter/), [HTMLSectionSplitter](/docs/how_to/HTML_section_aware_splitter/) | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) | -| Markdown | [MarkdownHeaderTextSplitter](/docs/how_to/markdown_header_metadata_splitter/), | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) | +| Markdown | [MarkdownHeaderTextSplitter](/docs/how_to/markdown_header_metadata_splitter/), [ExperimentalMarkdownSyntaxTextSplitter](/docs/how_to/experimental_markdown_syntax_text_splitter/) | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. The `ExperimentalMarkdownSyntaxTextSplitter` retains the original whitespace and formatting, addressing issues with code blocks and nested lists. | | Code | [many languages](/docs/how_to/code_splitter/) | Code (Python, JS) specific characters | | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. | | Token | [many classes](/docs/how_to/split_by_token/) | Tokens | | Splits text on tokens. There exist a few different ways to measure tokens. | | Character | [CharacterTextSplitter](/docs/how_to/character_text_splitter/) | A user defined character | | Splits text based on a user defined character. One of the simpler methods. | | Semantic Chunker (Experimental) | [SemanticChunker](/docs/how_to/semantic-chunker/) | Sentences | | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) | | Integration: AI21 Semantic | [AI21SemanticTextSplitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter/) | ✅ | Identifies distinct topics that form coherent pieces of text and splits along those. | +#### Markdown + +LangChain provides specialized text splitters for Markdown documents. These splitters are designed to handle Markdown-specific syntax and preserve the structure of the document. + +- **MarkdownHeaderTextSplitter**: Splits text based on Markdown headers, adding relevant information about where each chunk came from. +- **ExperimentalMarkdownSyntaxTextSplitter**: Retains the original whitespace and formatting, addressing issues with code blocks and nested lists. + +For guidance on using these splitters, refer to the [how-to guides](/docs/how_to/#text-splitters). + ### Evaluation diff --git a/docs/docs/how_to/index.mdx b/docs/docs/how_to/index.mdx index b481805eaafaf..90e493727c4d1 100644 --- a/docs/docs/how_to/index.mdx +++ b/docs/docs/how_to/index.mdx @@ -134,6 +134,7 @@ What LangChain calls [LLMs](/docs/concepts/#llms) are older forms of language mo - [How to: split by character](/docs/how_to/character_text_splitter) - [How to: split code](/docs/how_to/code_splitter) - [How to: split Markdown by headers](/docs/how_to/markdown_header_metadata_splitter) +- [How to: split Markdown with experimental syntax retention](/docs/how_to/experimental_markdown_syntax_text_splitter) - [How to: recursively split JSON](/docs/how_to/recursive_json_splitter) - [How to: split text into semantic chunks](/docs/how_to/semantic-chunker) - [How to: split by tokens](/docs/how_to/split_by_token)