diff --git a/.buildinfo b/.buildinfo
index f990857270..465d4039d3 100644
--- a/.buildinfo
+++ b/.buildinfo
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 39ac1536ba6b738ef4f304e6af7e643a
+config: 0df4f9145e5dec97b8895ea31e54e8f0
tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/_sources/introduction/getting_started.rst.txt b/_sources/introduction/getting_started.rst.txt
index e5cb710a17..bff137c778 100644
--- a/_sources/introduction/getting_started.rst.txt
+++ b/_sources/introduction/getting_started.rst.txt
@@ -173,7 +173,7 @@ of the table will be available in the element metadata under ``element.metadata.
table extraction is available, the ``partition`` function will extract tables automatically if they are present.
For PDFs and images, table extraction requires a relatively expensive call to a table recognition model, and so for those
document types table extraction is an option you need to enable. If you would like to extract tables for PDFs or images,
-pass in ``infer_table_structured=True``. Here is an example (Note: this example requires the ``pdf`` extra. This can be installed with ``pip install "unstructured[pdf]"``):
+pass in ``infer_table_structure=True``. Here is an example (Note: this example requires the ``pdf`` extra. This can be installed with ``pip install "unstructured[pdf]"``):
.. code:: python
diff --git a/_sources/introduction/key_concepts.rst.txt b/_sources/introduction/key_concepts.rst.txt
index 1c340d5cab..9f3bd8e5dc 100644
--- a/_sources/introduction/key_concepts.rst.txt
+++ b/_sources/introduction/key_concepts.rst.txt
@@ -6,7 +6,7 @@ Natural Language Processing (NLP) encompasses a broad spectrum of tasks and meth
Data Ingestion
--------------
-Unstructured's ``upstream connectors`` make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you'd like to read more on our upstream connectors, you can find details `here <../upstream_connectors.html>`__.
+Unstructured's ``upstream connectors`` make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you'd like to read more on our upstream connectors, you can find details `here Unstructured’s Unstructured’s Before the core analysis, raw data often requires significant preprocessing: Data ingestion: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our data connectors. Data ingestion: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our data connectors. Data preprocessing and cleaning: Once you’ve identified and collected your data sources a good practice is to remove any unnecessary artifacts within the dataset. At Unstructured we have a variety of different tools to remove unneccesary elements. Found here Chunking: The next step is to break your text down into digestable pieces for your LLM to be able to consume. LangChain, Llama Index and Haystack offer chunking funcionalities. Embedding: After chunking, you will need to convert the text into a numerical representation (vector embedding) that a LLM can understand. OpenAI, Cohere, and Hugging Face all offer embedding models.Key Concepts
Data Ingestion
-upstream connectors
make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you’d like to read more on our upstream connectors, you can find details here.upstream connectors
make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you’d like to read more on our upstream connectors, you can find details here.Data Preprocessing
Retrieval Augmented Generation
-
Tablespartition function will extract tables automatically if they are present.
For PDFs and images, table extraction requires a relatively expensive call to a table recognition model, and so for those
document types table extraction is an option you need to enable. If you would like to extract tables for PDFs or images,
-pass in
infer_table_structured=True
. Here is an example (Note: this example requires the pdf
extra. This can be installed with pip install "unstructured[pdf]"
):
infer_table_structure=True
. Here is an example (Note: this example requires the pdf
extra. This can be installed with pip install "unstructured[pdf]"
):
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/layout-parser-paper.pdf"
diff --git a/introduction/getting_started.html b/introduction/getting_started.html
index 88beaa42e9..6ae507a269 100644
--- a/introduction/getting_started.html
+++ b/introduction/getting_started.html
@@ -7,7 +7,7 @@
- Getting Started - Unstructured 0.10.19 documentation
+ Getting Started - Unstructured 0.10.20 documentation
@@ -561,7 +561,7 @@ Tablespartition function will extract tables automatically if they are present.
For PDFs and images, table extraction requires a relatively expensive call to a table recognition model, and so for those
document types table extraction is an option you need to enable. If you would like to extract tables for PDFs or images,
-pass in infer_table_structured=True
. Here is an example (Note: this example requires the pdf
extra. This can be installed with pip install "unstructured[pdf]"
):
+pass in infer_table_structure=True
. Here is an example (Note: this example requires the pdf
extra. This can be installed with pip install "unstructured[pdf]"
):
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/layout-parser-paper.pdf"
diff --git a/introduction/key_concepts.html b/introduction/key_concepts.html
index 776bcbf2a5..4d53fc7104 100644
--- a/introduction/key_concepts.html
+++ b/introduction/key_concepts.html
@@ -7,7 +7,7 @@
- Key Concepts - Unstructured 0.10.19 documentation
+ Key Concepts - Unstructured 0.10.20 documentation
@@ -431,7 +431,7 @@ Key Concepts
Data Ingestion
-Unstructured’s upstream connectors
make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you’d like to read more on our upstream connectors, you can find details here.
+Unstructured’s upstream connectors
make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you’d like to read more on our upstream connectors, you can find details here.
Data Preprocessing
Before the core analysis, raw data often requires significant preprocessing:
@@ -473,7 +473,7 @@ Retrieval Augmented Generation
-Data ingestion: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our data connectors.
+Data ingestion: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our data connectors.
Data preprocessing and cleaning: Once you’ve identified and collected your data sources a good practice is to remove any unnecessary artifacts within the dataset. At Unstructured we have a variety of different tools to remove unneccesary elements. Found here
Chunking: The next step is to break your text down into digestable pieces for your LLM to be able to consume. LangChain, Llama Index and Haystack offer chunking funcionalities.
Embedding: After chunking, you will need to convert the text into a numerical representation (vector embedding) that a LLM can understand. OpenAI, Cohere, and Hugging Face all offer embedding models.
diff --git a/introduction/overview.html b/introduction/overview.html
index 4f9465bee4..0d3dc332ee 100644
--- a/introduction/overview.html
+++ b/introduction/overview.html
@@ -7,7 +7,7 @@
- Overview - Unstructured 0.10.19 documentation
+ Overview - Unstructured 0.10.20 documentation
diff --git a/metadata.html b/metadata.html
index c9afe719bb..3f0d9f9893 100644
--- a/metadata.html
+++ b/metadata.html
@@ -7,7 +7,7 @@
- Metadata - Unstructured 0.10.19 documentation
+ Metadata - Unstructured 0.10.20 documentation
@@ -493,16 +493,11 @@ Common Metadata FieldsTags on text that is emphasized in the original document
-num_characters
-The number of characters used
-for max_characters in add_chunking_strategy
-Used for chunking.
-
-is_continuation
+is_continuation
True if element is a continuation of a previous element
Only relevant for chunking, if an element was divided into two due to max_characters
.
-detection_class_prob
+detection_class_prob
Detection model class probabilities
From unstructured-inference, hi-res strategy.
diff --git a/search.html b/search.html
index d96a810150..e84ee089c1 100644
--- a/search.html
+++ b/search.html
@@ -5,7 +5,7 @@
- Search - Unstructured 0.10.19 documentation
+ Search - Unstructured 0.10.20 documentation
diff --git a/searchindex.js b/searchindex.js
index d1bc1e7ed4..41b389b580 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"docnames": ["api", "best_practices", "best_practices/models", "best_practices/strategies", "bricks", "bricks/chunking", "bricks/cleaning", "bricks/embedding", "bricks/extracting", "bricks/partition", "bricks/staging", "destination_connectors", "destination_connectors/azure_cognitive_search", "destination_connectors/delta_table", "examples", "index", "installation/docker", "installation/full_installation", "installing", "integrations", "introduction", "introduction/getting_started", "introduction/key_concepts", "introduction/overview", "metadata", "source_connectors", "source_connectors/airtable", "source_connectors/azure", "source_connectors/biomed", "source_connectors/box", "source_connectors/confluence", "source_connectors/delta_table", "source_connectors/discord", "source_connectors/dropbox", "source_connectors/elasticsearch", "source_connectors/github", "source_connectors/gitlab", "source_connectors/google_cloud_storage", "source_connectors/google_drive", "source_connectors/jira", "source_connectors/local_connector", "source_connectors/notion", "source_connectors/onedrive", "source_connectors/outlook", "source_connectors/reddit", "source_connectors/s3", "source_connectors/salesforce", "source_connectors/sharepoint", "source_connectors/slack", "source_connectors/wikipedia"], "filenames": ["api.rst", "best_practices.rst", "best_practices/models.rst", "best_practices/strategies.rst", "bricks.rst", "bricks/chunking.rst", "bricks/cleaning.rst", "bricks/embedding.rst", "bricks/extracting.rst", "bricks/partition.rst", "bricks/staging.rst", "destination_connectors.rst", "destination_connectors/azure_cognitive_search.rst", "destination_connectors/delta_table.rst", "examples.rst", "index.rst", "installation/docker.rst", "installation/full_installation.rst", "installing.rst", "integrations.rst", "introduction.rst", "introduction/getting_started.rst", "introduction/key_concepts.rst", "introduction/overview.rst", "metadata.rst", "source_connectors.rst", "source_connectors/airtable.rst", "source_connectors/azure.rst", "source_connectors/biomed.rst", "source_connectors/box.rst", "source_connectors/confluence.rst", "source_connectors/delta_table.rst", "source_connectors/discord.rst", "source_connectors/dropbox.rst", "source_connectors/elasticsearch.rst", "source_connectors/github.rst", "source_connectors/gitlab.rst", "source_connectors/google_cloud_storage.rst", "source_connectors/google_drive.rst", "source_connectors/jira.rst", "source_connectors/local_connector.rst", "source_connectors/notion.rst", "source_connectors/onedrive.rst", "source_connectors/outlook.rst", "source_connectors/reddit.rst", "source_connectors/s3.rst", "source_connectors/salesforce.rst", "source_connectors/sharepoint.rst", "source_connectors/slack.rst", "source_connectors/wikipedia.rst"], "titles": ["Unstructured API", "Best Practices", "Models", "Strategies", "Bricks", "Chunking", "Cleaning", "Embedding", "Extracting", "Partitioning", "Staging", "Destination Connectors", "Azure Cognitive Search", "Delta Table", "Examples", "Unstructured Core Library", "Docker Installation", "Full Installation", "Installation", "Integrations", "Introduction", "Getting Started", "Key Concepts", "Overview", "Metadata", "Source Connectors", "Airtable", "Azure", "Biomed", "Box", "Confluence", "Delta Table", "Discord", "Dropbox", "Elasticsearch", "Github", "Gitlab", "Google Cloud Storage", "Google Drive", "Jira", "Local", "Notion", "One Drive", "Outlook", "Reddit", "S3", "Salesforce", "Sharepoint", "Slack", "Wikipedia"], "terms": {"try": [0, 9, 20, 21], "our": [0, 10, 11, 14, 16, 19, 20, 22, 23, 25], "host": [0, 9, 10, 15, 19, 20, 23], "It": [0, 2, 9, 10, 17, 19, 24], "": [0, 2, 6, 8, 10, 16, 17, 18, 19, 20, 22, 23, 24, 47], "freeli": 0, "avail": [0, 1, 2, 3, 9, 14, 17, 19, 20, 21, 24], "ani": [0, 2, 7, 9, 10, 12, 13, 17, 19, 20, 21, 22], "list": [0, 2, 5, 6, 7, 8, 9, 10, 12, 13, 19, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "abov": [0, 9, 10, 20, 21, 24], "thi": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 17, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "i": [0, 2, 3, 5, 6, 7, 8, 9, 10, 14, 15, 17, 19, 20, 21, 22, 23, 24, 42], "easiest": [0, 9, 20, 23], "wai": [0, 2, 3, 9, 19, 20, 23], "get": [0, 6, 9, 10, 14, 17, 19, 23], "start": [0, 2, 5, 10, 14, 16, 23, 24, 42, 48], "all": [0, 3, 6, 8, 9, 10, 12, 13, 15, 17, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "you": [0, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "need": [0, 2, 6, 7, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "an": [0, 2, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 17, 19, 20, 21, 22, 23, 24], "kei": [0, 7, 9, 10, 12, 19, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "can": [0, 2, 3, 6, 8, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "your": [0, 3, 6, 9, 10, 11, 12, 13, 14, 15, 17, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "here": [0, 2, 6, 8, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "now": [0, 20, 21, 24], "todai": 0, "quick": 0, "exampl": [0, 2, 5, 6, 7, 8, 9, 10, 13, 15, 16, 17, 19, 20, 21, 24, 31, 32, 40], "shell": [0, 12, 13, 16, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "python": [0, 9, 12, 13, 14, 20, 21, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "curl": [0, 14, 17], "x": [0, 14], "post": [0, 14, 17, 20, 22, 24, 44], "http": [0, 5, 7, 9, 10, 12, 14, 17, 30, 34, 36, 39, 42, 47], "io": [0, 9, 16, 19, 30, 35, 39], "gener": [0, 4, 5, 7, 9, 10, 14, 19, 23], "v0": [0, 9, 17, 36], "h": [0, 14], "accept": [0, 2, 9, 10, 12, 13, 19, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "applic": [0, 6, 9, 16, 20, 22, 23, 24], "json": [0, 4, 10, 12, 14, 19, 24], "content": [0, 4, 6, 9, 10, 19, 20, 21, 24, 36], "multipart": 0, "form": [0, 19, 24], "data": [0, 4, 6, 10, 11, 12, 13, 14, 15, 16, 17, 19, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "f": [0, 9, 10, 14, 17, 20, 21], "sampl": [0, 2, 10, 13, 14, 31], "doc": [0, 2, 9, 10, 14, 16, 17, 19, 20, 21, 23, 24, 40], "famili": 0, "dai": [0, 10], "eml": [0, 8, 9, 14, 20, 21, 24], "jq": [0, 34], "c": [0, 17], "less": 0, "r": [0, 6, 8, 9, 17, 24, 44], "import": [0, 2, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "request": [0, 9], "url": [0, 5, 9, 10, 12, 14, 17, 24, 27, 29, 30, 33, 34, 35, 36, 37, 39, 42, 45], "header": [0, 9, 20, 21, 24], "auto": [0, 3, 9, 20, 21], "file_path": 0, "path": [0, 9, 10, 14, 17, 19, 24, 28, 38, 40, 42, 46, 47], "To": [0, 2, 6, 7, 9, 12, 14, 16, 17, 19, 20, 21], "file_data": 0, "open": [0, 9, 10, 14, 17, 19, 20, 21, 23, 49], "rb": [0, 9, 14, 20, 21], "respons": [0, 9, 14, 20, 22], "close": [0, 10], "json_respons": 0, "below": [0, 6, 9, 10, 14, 16, 18, 20, 21, 24], "find": [0, 9, 12, 13, 14, 16, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "more": [0, 6, 8, 9, 10, 12, 13, 15, 19, 20, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "comprehens": [0, 17], "overview": [0, 1], "capabl": [0, 20, 23], "For": [0, 6, 8, 9, 10, 12, 13, 14, 17, 18, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "detail": [0, 10, 19, 20, 22, 23, 24], "inform": [0, 1, 3, 6, 8, 9, 10, 12, 13, 15, 17, 19, 20, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "schema": [0, 10], "refer": [0, 16], "document": [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 17, 19, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "note": [0, 2, 10, 12, 13, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "also": [0, 5, 6, 8, 9, 10, 14, 16, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "check": [0, 6, 8, 9, 10, 12, 13, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "section": [0, 4, 5, 6, 8, 9, 12, 14, 17, 19, 20, 21, 22, 24], "categori": [0, 20, 21, 24, 46], "plaintext": 0, "html": [0, 5, 6, 9, 14, 15, 19, 20, 21, 24], "md": [0, 9, 17], "msg": [0, 9, 17, 24], "rst": [0, 9, 17], "rtf": [0, 9, 17, 20, 21], "txt": [0, 6, 8, 9, 14, 16, 19], "jpeg": 0, "png": [0, 3, 9], "csv": [0, 4, 9, 10, 17, 19], "docx": [0, 9, 14, 17, 24], "epub": [0, 9, 10, 17, 20, 21, 24], "odt": [0, 9, 17], "ppt": [0, 9, 17, 24], "pptx": [0, 9, 14, 17], "tsv": [0, 9, 17], "xlsx": [0, 9, 14, 17, 24], "current": [0, 9, 14, 20, 21], "pipelin": [0, 2, 10, 14, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "recogn": [0, 9], "choos": [0, 3, 6, 9, 17, 20, 22], "relev": [0, 20, 22, 24], "partit": [0, 2, 3, 4, 5, 6, 7, 10, 15, 16, 17, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "function": [0, 2, 3, 4, 5, 6, 8, 9, 10, 14, 19, 20, 21], "process": [0, 4, 6, 9, 10, 11, 12, 13, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "when": [0, 2, 5, 6, 9, 10, 14, 20, 21, 22, 24], "element": [0, 2, 3, 4, 5, 6, 7, 9, 10, 14, 16, 19, 22, 24], "ar": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 17, 19, 20, 21, 22, 24, 25], "from": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "mai": [0, 9, 10, 19, 20, 21, 24], "bound": [0, 24], "box": [0, 17, 24, 25], "well": [0, 14], "set": [0, 2, 5, 6, 7, 8, 9, 12, 14, 20, 23, 24, 45], "true": [0, 5, 6, 8, 9, 10, 12, 17, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "add": [0, 20, 21], "field": [0, 8, 9, 12], "layout": [0, 2, 3, 9, 10, 16, 17, 20, 21, 24], "parser": [0, 2, 6, 9, 10, 16, 17, 20, 21], "paper": [0, 2, 9, 10, 16, 17, 20, 21], "specifi": [0, 2, 3, 5, 6, 8, 9, 10, 16, 19, 24], "decod": [0, 12, 13], "text": [0, 2, 3, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 17, 19, 21, 24], "input": [0, 5, 6, 8, 9, 10, 19, 40], "If": [0, 2, 5, 6, 8, 9, 10, 14, 16, 17, 20, 21, 22, 23, 24], "valu": [0, 5, 9, 10, 14, 19, 20, 22, 24], "provid": [0, 1, 2, 9, 17, 20, 22], "utf": [0, 6], "8": [0, 6, 17], "fake": [0, 9, 16, 20, 21], "power": [0, 9, 10, 15], "point": [0, 6, 8, 9, 12, 14, 17, 24], "utf_8": 0, "what": [0, 9, 19, 20, 23], "ocr_languag": [0, 9], "kwarg": [0, 3, 5, 6, 8, 9, 10, 14, 24], "see": [0, 5, 6, 7, 9, 10, 11, 14, 17, 19, 20, 21, 24, 25], "tesseract": [0, 9, 20, 21], "full": [0, 6, 9, 10, 12, 13, 18, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "instal": [0, 9, 12, 13, 14, 15, 19, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "instruct": [0, 9, 14, 15, 16, 17, 20, 21], "onli": [0, 9, 10, 16, 17, 20, 21, 22, 24, 47], "appli": [0, 6, 9, 17, 24], "alreadi": [0, 9, 13], "english": [0, 9], "korean": [0, 9], "ocr_onli": [0, 3, 9], "eng": [0, 9], "kor": [0, 9], "By": [0, 6, 8, 9, 10, 17, 20, 21, 22], "default": [0, 3, 5, 6, 8, 9, 10, 17, 20, 21, 42], "result": [0, 2, 9, 10, 14, 15, 17, 19, 20, 22, 24], "output_format": 0, "pass": [0, 2, 6, 9, 14, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "include_page_break": [0, 9], "includ": [0, 3, 5, 6, 7, 9, 10, 14, 15, 17, 19, 20, 21, 22, 24], "pagebreak": [0, 9, 20, 21], "four": 0, "fast": [0, 2, 3, 9, 10, 12, 16, 17], "work": [0, 6, 8, 9, 17], "do": [0, 5, 6, 9, 10, 14, 17], "have": [0, 9, 10, 12, 13, 14, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "embed": [0, 4, 12, 19], "On": [0, 17], "hand": [0, 20, 23], "better": 0, "choic": [0, 10, 14], "within": [0, 2, 9, 15, 20, 22, 24, 47], "achiev": [0, 20, 22], "greater": 0, "precis": 0, "Be": 0, "awar": 0, "take": [0, 6, 8, 9, 10, 17, 19, 20, 22], "20": 0, "time": [0, 2, 6, 8, 20, 22], "longer": 0, "compar": 0, "option": [0, 3, 6, 8, 9, 10, 12, 13, 14, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "make": [0, 9, 12, 15, 16, 19, 20, 21, 22], "The": [0, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24], "run": [0, 2, 7, 9, 10, 14, 16, 17, 19, 20, 21, 23], "through": [0, 9, 14, 15, 24], "ha": [0, 6, 9, 10, 19, 24], "difficulti": [0, 9], "order": [0, 9, 14, 17, 24], "multipl": [0, 9, 24], "column": [0, 9, 10, 13, 19], "recommend": [0, 3, 9, 10, 17, 20, 21, 22], "pleas": [0, 18], "fall": [0, 2, 9, 10, 20, 21], "back": [0, 2, 9, 10, 20, 21], "anoth": [0, 5, 6, 9, 17, 24], "best": [0, 6, 15], "world": [0, 20, 22], "determin": [0, 9, 14, 16, 20, 21], "mode": [0, 9, 16], "otherwis": [0, 9, 17], "argument": [0, 2, 9, 10], "hi_res_model_nam": 0, "shown": [0, 9, 12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "code": [0, 6, 7, 8, 9, 10, 20, 21], "block": [0, 7], "doe": [0, 5, 9, 10, 17, 19], "structur": [0, 3, 9, 10, 12, 13, 15, 20, 21, 22, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "ensur": [0, 2, 9, 20, 21, 22, 23], "pdf_infer_table_structur": [0, 9], "fals": [0, 6, 9, 17, 24, 41, 42, 47, 49], "becaus": [0, 9, 17, 24], "computation": 0, "expens": [0, 20, 21], "we": [0, 3, 6, 9, 10, 11, 14, 15, 16, 20, 21, 22, 25], "enabl": [0, 9, 20, 21, 22, 24], "disabl": [0, 9], "than": [0, 2, 17], "skip_infer_table_typ": 0, "want": [0, 9, 10, 19, 20, 21], "skip": [0, 14], "excel": [0, 6, 9], "which": [0, 2, 3, 5, 7, 9, 10, 14, 16, 19, 20, 22, 24], "jpg": [0, 3, 9, 14, 17], "xl": [0, 9], "don": [0, 9, 11, 25], "t": [0, 6, 9, 11, 14, 16], "empti": [0, 9, 10, 19], "xml_keep_tag": [0, 9], "retain": [0, 20, 22], "simpli": [0, 10], "self": [0, 9], "strongli": 0, "suggest": 0, "so": [0, 9, 10, 20, 21], "contain": [0, 6, 8, 9, 12, 13, 17, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "follow": [0, 2, 3, 4, 5, 7, 9, 10, 14, 17, 18, 19, 20, 21, 22, 24], "intend": [0, 1], "help": [0, 6, 9, 10, 12, 13, 14, 15, 17, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "up": [0, 6, 8, 10, 22, 23], "interact": 0, "machin": [0, 6, 8, 15, 16, 17, 19], "multi": [0, 16], "platform": [0, 7, 11, 12, 13, 15, 16, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "built": 0, "both": [0, 9, 16, 20, 23], "x86_64": [0, 16], "appl": [0, 16], "silicon": [0, 16], "hardwar": [0, 16], "pull": [0, 3, 14], "should": [0, 2, 4, 6, 10, 14, 16, 20, 21, 24], "download": [0, 2, 13, 14, 16, 28, 32, 48], "correspond": [0, 4, 9, 24], "architectur": [0, 16], "e": [0, 7, 9, 10, 16, 17, 24, 47], "g": [0, 9, 16, 24, 37, 47], "linux": [0, 16], "amd64": [0, 16], "push": [0, 16], "main": [0, 5, 9, 10, 16, 17, 35], "each": [0, 7, 8, 9, 10, 19, 20, 22, 24], "short": [0, 16, 24], "commit": [0, 16], "hash": [0, 16, 20, 21], "fbc7a69": [0, 16], "0": [0, 5, 6, 8, 9, 10, 12, 13, 14, 16, 17, 20, 21, 24, 36], "5": [0, 9, 10, 16, 20, 21], "dev1": [0, 16], "most": [0, 9, 10, 16, 19, 20, 21, 22, 23, 42], "recent": [0, 16], "latest": [0, 16], "leverag": [0, 2, 3], "repositori": [0, 16], "quai": 0, "onc": [0, 10, 16, 20, 22], "launch": [0, 20, 23], "web": 0, "app": [0, 42, 47], "localhost": [0, 9, 10, 14, 34], "8000": 0, "p": [0, 6], "d": [0, 6, 8, 9, 20, 21, 22, 24], "rm": 0, "name": [0, 2, 7, 8, 10, 12, 14, 16, 17, 24, 27, 34, 42, 44], "port": 0, "ll": [0, 9, 12, 13, 14, 17, 20, 21, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "fork": 0, "sart": 0, "one": [0, 5, 10], "A": [0, 6, 8, 9, 10, 20, 22, 24], "jupyt": 0, "notebook": [0, 19], "server": [0, 9, 24], "guid": [0, 2, 12, 13, 16, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "mani": [0, 6, 8, 20, 22], "o": [0, 2, 7, 9, 10, 12, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "depend": [0, 2, 9, 12, 13, 14, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "requir": [0, 16, 17, 18, 19, 20, 21, 22, 23], "abil": [0, 9], "desir": 0, "hit": [0, 14, 17], "directori": [0, 10, 14, 17, 20, 21, 24], "sever": [0, 4], "unstructur": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "offer": [1, 3, 19, 21, 22], "few": [1, 9, 10, 17, 20, 23], "strategi": [1, 2, 9, 10, 12, 24], "model": [1, 6, 7, 8, 9, 10, 17, 19, 21, 23, 24], "extract": [1, 2, 3, 4, 9, 15, 19, 20, 21], "These": [1, 3, 9, 16, 20, 21, 22, 24], "guidelin": 1, "configur": 1, "optim": [1, 2, 15, 19, 20, 22], "high": [1, 10, 20, 22], "level": [1, 6, 8, 17, 24], "librari": [1, 2, 3, 4, 6, 9, 14, 16, 17, 18, 19, 20, 21, 23], "ocr": [2, 3, 9, 17, 20, 21], "base": [2, 3, 7, 9, 10, 17, 19, 20, 21, 24], "transform": [2, 6, 8, 10, 17, 19, 20, 23], "detect": [2, 5, 6, 7, 8, 9, 20, 21, 22, 24], "complex": 2, "predict": [2, 10, 14, 19], "type": [2, 3, 4, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "basic": [2, 3, 17, 20, 21, 23], "usag": [2, 3, 15, 17, 19, 20, 21], "filenam": [2, 3, 9, 10, 12, 14, 16, 17, 20, 21, 24, 26, 30, 34, 39, 43], "hi_r": [2, 3, 9], "model_nam": [2, 10], "chipper": 2, "defin": [2, 7, 19], "infer": [2, 3, 9, 20, 21, 24], "detectron2_onnx": 2, "comput": [2, 17, 20, 22, 24], "vision": 2, "facebook": 2, "ai": [2, 20, 22], "object": [2, 8, 9, 10, 12, 14, 19, 20, 21], "segment": [2, 9, 20, 22], "algorithm": [2, 20, 22], "onnx": 2, "runtim": 2, "fastest": 2, "yolox": 2, "singl": [2, 9, 12, 16, 17, 20, 22], "stage": [2, 4, 14, 15, 19, 20, 21], "real": [2, 10, 14, 20, 22], "detector": 2, "modifi": [2, 10, 24], "yolov3": 2, "darknet53": 2, "backbon": 2, "yolox_quant": 2, "faster": [2, 9], "its": [2, 5, 24], "speed": [2, 20, 22], "closer": 2, "detectron2": [2, 3, 9, 17], "beta": 2, "version": [2, 9, 12, 16, 24], "hous": 2, "imag": [2, 3, 5, 9, 17, 20, 21, 24], "visual": [2, 6, 8, 9], "understand": [2, 4, 20, 21, 22], "vdu": 2, "unstructured_hi_res_model_nam": 2, "environ": [2, 7, 17, 19], "variabl": [2, 7, 17], "There": [2, 4, 9, 10, 20, 22], "three": [2, 6, 8, 9], "store": [2, 9, 10, 12, 13, 14, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "pdf": [2, 3, 9, 10, 12, 14, 15, 16, 17, 20, 21, 23, 24, 28, 45], "partition_pdf": [2, 3, 10, 16, 17, 20, 21], "out_yolox": 2, "unstructured_infer": [2, 17], "get_model": 2, "documentlayout": 2, "from_fil": 2, "detection_model": 2, "util": [2, 10, 14, 20, 21], "zoo": 2, "In": [2, 6, 9, 10, 14, 20, 21, 23], "layoutpars": [2, 17], "variou": 2, "pre": [2, 4, 9, 10, 14, 19], "train": [2, 9, 20, 21, 22], "analysi": [2, 9, 19, 20, 22], "featur": [2, 9, 14], "unstructureddetectronmodel": 2, "class": [2, 7, 10, 14, 24], "faster_rcnn_r_50_fpn_3x": 2, "pretrain": [2, 20, 23], "doclaynet": 2, "But": 2, "differ": [2, 3, 4, 9, 10, 20, 21, 22], "construct": 2, "paramet": [2, 3, 6, 8, 9, 10, 19], "light": 2, "wrapper": 2, "around": [2, 10], "detectron2layoutmodel": 2, "same": [2, 9, 10, 14, 20, 21, 24], "seamlessli": 2, "integr": [2, 7, 15, 20, 22, 23], "custom": [2, 6, 9, 20, 22, 23], "wrap": 2, "unstructuredobjectdetectionmodel": 2, "act": 2, "intermediari": 2, "between": [2, 5, 6, 8, 9], "workflow": [2, 9, 10, 14, 15, 20, 21, 22, 23], "subclass": [2, 7], "incorpor": 2, "two": [2, 6, 8, 9, 10, 20, 21, 24], "vital": 2, "method": [2, 6, 7, 9, 14, 20, 21, 24], "design": [2, 15, 19, 20, 22, 23], "pil": [2, 17], "return": [2, 6, 7, 8, 9, 10, 14, 19, 24], "layoutel": 2, "facilit": [2, 20, 22], "commun": [2, 11, 12, 13, 25], "initi": [2, 10], "essenti": [2, 9, 20, 21], "load": [2, 19], "prep": 2, "guarante": [2, 20, 21], "readi": [2, 10, 14, 19, 20, 21], "incom": 2, "task": [2, 6, 10, 14, 15, 19, 20, 22, 23], "output": [2, 6, 8, 9, 10, 12, 13, 14, 15, 19, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "specif": [2, 9, 17, 20, 22, 23, 24], "smoothli": 2, "perform": 2, "varieti": [3, 19, 20, 22, 24], "preprocess": [3, 15, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "characterist": [3, 9], "tradit": [3, 20, 23], "nlp": [3, 6, 8, 10, 20, 22], "techniqu": [3, 20, 22], "quickli": [3, 10, 20, 22], "good": [3, 14, 20, 22], "file": [3, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "identifi": [3, 9, 20, 22], "us": [3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 17, 19, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "advantag": [3, 9], "gain": [3, 6, 9], "addit": [3, 9, 10, 14, 17, 20, 21], "about": [3, 6, 7, 8, 9, 10, 12, 13, 15, 19, 20, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "case": [3, 4, 5, 6, 7, 9, 21, 42, 46], "highli": [3, 9, 20, 21, 22], "sensit": [3, 9], "correct": [3, 9], "classif": [3, 9, 10, 14, 20, 22], "optic": 3, "charact": [3, 5, 6, 8, 9, 20, 22, 24], "recognit": [3, 10, 20, 21], "brick": [3, 6, 8, 9, 10, 15, 19], "tabl": [3, 5, 9, 11, 17, 24, 25], "support": [3, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 23, 24], "partition_imag": 3, "ye": [3, 9, 10], "encod": [3, 6, 7, 9], "page": [3, 5, 9, 20, 21, 24, 41, 49], "break": [3, 5, 6, 8, 9, 20, 22], "languag": [3, 6, 8, 9, 19, 24], "max": [3, 9], "live": 4, "primari": [4, 9, 20, 21, 24], "public": [4, 37], "api": [4, 7, 9, 10, 14, 15, 19, 20, 21, 23], "clean": [4, 8, 15, 20, 22], "chunk": [4, 7, 10, 19, 24], "after": [4, 5, 8, 10, 14, 16, 17, 19, 20, 21, 22], "read": [4, 9, 20, 21, 22], "how": [4, 5, 6, 7, 9, 10, 14, 15, 16, 17, 19, 20, 21, 22, 24], "remov": [4, 6, 8, 20, 22, 24], "unwant": [4, 6], "prepar": [4, 6, 10, 19], "downstream": [4, 6, 10, 15, 19, 20, 22, 23, 24], "retriev": [4, 5, 7, 23], "augment": [4, 5, 7, 23], "rag": [4, 5, 7, 20, 22, 23], "metadata": [5, 7, 9, 10, 12, 15, 20, 21, 26, 30, 34, 39, 43], "split": [5, 6, 8, 10, 17, 19, 20, 21], "subsect": 5, "combin": [5, 9, 19], "look": [5, 6, 8, 9, 10, 14, 20, 21, 24], "presenc": 5, "titl": [5, 9, 10, 19, 20, 21, 24, 49], "new": [5, 6, 9, 10, 11, 13, 14, 19, 24, 25], "creat": [5, 10, 13, 14, 16, 17, 24], "non": [5, 6, 9], "alwai": 5, "own": [5, 6], "chang": [5, 6, 8, 10, 17, 24], "occur": [5, 8], "number": [5, 7, 8, 9, 10, 24], "come": [5, 10, 14, 19, 20, 21], "attach": [5, 9, 10, 24], "instead": [5, 9, 10, 19, 20, 21, 22], "multipage_sect": 5, "allow": [5, 6, 9, 19, 20, 21, 24], "span": 5, "length": [5, 9, 10], "exce": 5, "new_after_n_char": 5, "1500": [5, 9], "possibl": [5, 17], "lenght": 5, "narrativetext": [5, 9, 10, 14, 19, 20, 21, 24], "similarli": 5, "under": [5, 7, 9, 14, 20, 21], "combine_under_n_char": 5, "thei": [5, 6, 9, 14, 20, 21, 22], "threshold": 5, "500": [5, 12], "seri": 5, "sometim": [5, 6], "happen": [5, 6], "listitem": [5, 9, 20, 21], "turn": [5, 9, 24], "off": [5, 9, 10], "behavior": [5, 6, 8, 9, 20, 21], "show": [5, 7, 9, 10, 12, 13, 14, 20, 21], "partition_html": [5, 6], "understandingwar": 5, "org": [5, 9, 17], "background": 5, "russian": [5, 6, 8], "offens": 5, "campaign": [5, 46], "assess": 5, "august": 5, "27": 5, "2023": [5, 9, 20, 23, 48], "print": [5, 6, 7, 9, 10, 12, 13, 20, 21, 24], "n": [5, 6, 8, 9, 20, 21], "80": 5, "As": [6, 9, 10, 20, 21], "part": [6, 10, 19, 20, 21, 22], "common": [6, 8, 9, 17, 42], "prior": [6, 9], "could": [6, 10, 20, 21], "impact": [6, 10], "qualiti": 6, "user": [6, 9, 10, 14, 20, 21, 22, 24, 30, 39, 42, 43, 44], "sanit": 6, "befor": [6, 8, 9, 20, 21, 22], "send": 6, "some": [6, 7, 9, 10, 14, 17, 20, 21, 22, 24], "automat": [6, 9, 20, 21], "philadelphia": [6, 9], "eagles\u00e2": 6, "x80": 6, "x99": 6, "victori": 6, "convert": [6, 9, 10, 14, 19, 22], "eagl": [6, 9], "snippet": [6, 20, 21], "cleaner": [6, 8, 9], "core": [6, 8, 9, 20, 21, 22], "without": [6, 9, 20, 21], "instanti": 6, "expect": [6, 10], "callabl": 6, "string": [6, 7, 8, 9, 10, 12, 24], "produc": [6, 9, 10], "invok": [6, 9], "sinc": 6, "just": [6, 19], "str": [6, 9, 10, 19, 20, 21], "easili": [6, 20, 23], "citat": 6, "re": [6, 8, 9, 12, 13, 14, 17, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "remove_cit": 6, "lambda": 6, "sub": [6, 8, 24], "1": [6, 7, 8, 9, 14, 17, 24], "3": [6, 8, 14, 20, 21, 24], "geoloc": 6, "combat": 6, "footag": 6, "confirm": [6, 20, 21], "dvorichn": 6, "area": [6, 10], "northwest": 6, "svatov": 6, "like": [6, 9, 10, 14, 16, 20, 21, 22, 24], "byte": [6, 14], "emoji": 6, "isn": 6, "hello": [6, 10], "\u00f0": 6, "x9f": 6, "x98": 6, "charset": 6, "sourc": [6, 8, 9, 10, 15, 19, 20, 21, 22, 23, 24, 49], "bullet": [6, 8, 24], "extra": [6, 9, 12, 13, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "whitespac": [6, 8], "dash": 6, "trail": [6, 8], "punctuat": [6, 8], "lowercas": 6, "extra_whitespac": 6, "trailing_punctu": 6, "item": [6, 24, 43], "1a": 6, "risk": [6, 10, 14, 19], "factor": [6, 20, 22], "begin": [6, 8, 9, 10], "appear": [6, 9, 10, 20, 21, 24], "love": [6, 8], "mors": 6, "handl": [6, 20, 21, 23], "special": [6, 10], "u2013": 6, "xa0": 6, "newlin": 6, "ascii": [6, 8], "x88thi": 6, "containsnon": 6, "alphanumer": [6, 8], "veri": [6, 8, 10], "b": 6, "postfix": 6, "match": [6, 12], "pattern": [6, 8, 10, 15, 17, 20, 21], "ignor": 6, "ignore_cas": 6, "strip": [6, 8], "end": [6, 8, 9, 10, 24, 48], "stop": [6, 8], "prefix": [6, 10], "lead": [6, 8, 46], "summari": [6, 14], "descript": [6, 10, 12, 24], "group": [6, 7, 8, 9], "togeth": [6, 8, 9], "paragraph": [6, 8, 9], "broken": [6, 8, 9, 20, 22], "line": [6, 8, 9, 16, 17, 24], "format": [6, 8, 9, 10, 14, 19, 20, 23], "purpos": [6, 8, 9], "line_split": [6, 8], "consid": [6, 8, 10, 17], "paragraph_split": [6, 8], "big": [6, 8, 9], "brown": [6, 8, 9], "fox": [6, 8, 9, 10], "wa": [6, 8, 9, 17, 24], "walk": [6, 8, 9], "down": [6, 8, 9, 10, 20, 21, 22], "lane": [6, 8, 9], "At": [6, 8, 9, 14, 17, 20, 22, 24], "met": [6, 8, 9], "bear": [6, 8, 9, 20, 22], "para_split_r": [6, 8], "compil": [6, 8, 17], "unicod": [6, 8], "quot": [6, 8], "replac": [6, 8], "x91": [6, 8], "replace_unicode_charact": [6, 8], "x93a": [6, 8], "x94": [6, 8], "x91a": [6, 8], "x92": [6, 8], "translat": [6, 8], "helsinki": [6, 8], "mt": [6, 8], "chines": [6, 8], "arab": [6, 8], "other": [6, 8, 15, 16, 20, 21, 24], "source_lang": [6, 8], "letter": [6, 8], "langdetect": [6, 8], "target_lang": [6, 8], "target": [6, 8, 9], "en": [6, 8, 17], "m": [6, 8, 12, 20, 21], "berlin": [6, 8], "ich": [6, 8], "bin": [6, 8], "ein": [6, 8], "\u044f": [6, 8], "\u0442\u043e\u0436\u0435": [6, 8], "\u043c\u043e\u0436\u043d\u043e": [6, 8], "\u043f\u0435\u0440\u0435\u0432\u043e\u0434\u0430\u0442\u044c": [6, 8], "\u0440\u0443\u0441\u0441\u043a\u0438\u0439": [6, 8], "\u044f\u0437\u044b\u043a": [6, 8], "ru": [6, 8], "obtain": [7, 19], "abstract": 7, "implement": [7, 20, 22], "embeddingencod": 7, "langchain": [7, 20, 22], "openai": [7, 20, 22], "hood": [7, 9], "connect": [7, 11, 15, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "piec": [7, 20, 22], "embed_docu": 7, "receiv": [7, 8, 16], "updat": [7, 24], "attribut": [7, 9, 10, 14, 20, 21], "embed_queri": 7, "queri": [7, 10, 20, 22, 34, 44], "float": 7, "vector": [7, 10, 19], "given": [7, 20, 22, 24], "num_of_dimens": 7, "properti": [7, 24], "denot": 7, "dimens": [7, 12], "via": [7, 20, 22], "is_unit_vector": 7, "unit": [7, 20, 22], "openai_api_kei": 7, "abl": [7, 20, 21, 22], "visit": 7, "com": [7, 8, 9, 10, 17, 36, 42, 47], "account": [7, 10, 14, 27, 38, 46], "emb": 7, "embedding_encod": 7, "api_kei": [7, 9, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "sentenc": [7, 20, 21, 22], "2": [7, 8, 10, 12, 13, 14, 17, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "query_embed": 7, "date": [8, 20, 22, 24, 48], "timezon": 8, "datetim": 8, "abc": 8, "def": [8, 10], "local": [8, 9, 10, 14, 15, 19, 20, 21, 25], "ba23": 8, "58b5": 8, "2236": 8, "45g2": 8, "88h2": 8, "local2": 8, "25": 8, "mapi": 8, "id": [8, 12, 14, 24, 38, 41, 42, 43, 44, 47], "32": 8, "88": 8, "5467": 8, "123": 8, "fri": 8, "26": 8, "mar": 8, "2021": [8, 20, 22], "11": [8, 10], "04": [8, 14, 48], "09": 8, "1200": 8, "4": [8, 10, 12, 14], "9": [8, 10, 17], "tzinfo": 8, "timedelta": 8, "second": [8, 9, 10], "43200": 8, "email": [8, 9, 20, 21, 30, 39, 42, 43], "address": [8, 20, 21, 24], "me": 8, "10": [8, 9, 10, 24, 44], "01": [8, 9], "ipv4": 8, "ipv6": 8, "ip": 8, "none": [8, 9, 17, 32], "index": [8, 9, 19, 20, 22, 24, 34], "th": [8, 20, 21], "occurr": 8, "speaker": [8, 24], "fly": 8, "am": 8, "phone": 8, "215": 8, "867": 8, "5309": 8, "raw": [9, 20, 21, 22], "decid": 9, "keep": [9, 12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "particular": [9, 20, 21], "summar": [9, 20, 21], "interest": [9, 24], "call": [9, 14, 20, 21, 22], "libmag": [9, 17, 20, 21], "appropri": [9, 10, 12, 13, 16, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "where": [9, 10, 20, 21, 23, 24], "filetyp": [9, 12, 14, 20, 21, 24], "extens": [9, 20, 21, 22, 23], "rout": [9, 20, 21], "know": [9, 11, 25], "directli": [9, 10, 16, 19], "mail": [9, 24], "partition_eml": 9, "No": 9, "markdown": 9, "offic": [9, 10, 20, 21], "plain": [9, 20, 21], "grouper": 9, "restructur": 9, "rich": 9, "word": [9, 10, 20, 22], "xml": [9, 14, 15, 20, 21], "tag": [9, 16, 24], "addition": [9, 17, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "bypass": 9, "logic": 9, "content_typ": 9, "either": [9, 14], "join": [9, 10], "example_docs_directori": 9, "el": [9, 20, 21], "15": [9, 10], "reason": 9, "unnecessari": [9, 20, 22], "program": 9, "fewer": 9, "least": [9, 10, 20, 21], "denomin": 9, "certain": [9, 16], "learn": [9, 14, 15, 19], "www": 9, "cnn": 9, "30": [9, 10], "sport": 9, "empir": 9, "state": [9, 10, 20, 22], "build": [9, 17, 20, 22], "green": 9, "spt": 9, "intl": 9, "simplest": [9, 19, 20, 21], "attempt": 9, "control": 9, "accur": [9, 20, 22, 24], "add_paragraph": 9, "style": 9, "head": [9, 20, 23], "my": [9, 10, 17, 24], "first": [9, 10, 12, 13, 14, 17, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "thought": 9, "bodi": 9, "normal": 9, "save": [9, 10, 20, 21], "mydoc": 9, "remot": [9, 10, 12, 24, 27, 29, 33, 37, 45], "forc": 9, "treat": 9, "mime": [9, 17], "conjunct": [9, 17], "ssl_verifi": 9, "whether": 9, "ssl": 9, "verif": 9, "githubusercont": 9, "licens": 9, "text_as_html": [9, 12, 20, 21, 24], "represent": [9, 20, 21, 22, 24], "stanlei": 9, "cup": 9, "microsoft": [9, 17, 47], "partiton_doc": 9, "libreoffic": [9, 20, 21], "footer": [9, 20, 21, 24], "per": [9, 10], "msft": 9, "header_footer_typ": [9, 12, 24], "indic": [9, 10, 24], "valid": [9, 10, 19, 24], "first_pag": [9, 24], "even_pag": 9, "present": [9, 14, 17, 20, 21], "insert": 9, "render": 9, "even": [9, 10, 20, 22], "them": [9, 16, 20, 21], "export": 9, "client": [9, 10, 42, 43, 44, 47], "outlook": [9, 17, 24, 25], "gmail": 9, "content_sourc": 9, "respect": [9, 16], "sender": [9, 24], "recipi": [9, 24], "etc": 9, "include_head": 9, "must": [9, 10, 17], "tupl": 9, "max_partit": 9, "maximum": 9, "select": [9, 10, 14, 24], "roughli": 9, "averag": [9, 10, 14], "process_attach": 9, "attachment_partition": 9, "pgp": 9, "encrypt": 9, "emit": 9, "warn": [9, 17], "book": [9, 24], "epub3": 9, "pandoc": [9, 20, 21], "system": [9, 10, 12, 15, 17, 19, 20, 21, 24], "winter": 9, "invoc": 9, "equival": 9, "10k": [9, 20, 21], "illustr": 9, "fetch": 9, "agent": [9, 44], "yourscriptnam": 9, "websit": 9, "articl": 9, "grab": [9, 20, 22], "site": [9, 24, 36, 47], "convent": 9, "activ": [9, 10, 14, 17], "html_assemble_articl": 9, "deu": 9, "german": 9, "pack": 9, "pars": [9, 14, 16, 17, 42], "swedish": 9, "swe": 9, "infer_table_structur": [9, 20, 21], "recoomend": 9, "readm": 9, "similar": [9, 10, 14, 20, 22], "rest": [9, 20, 23], "narr": [9, 10, 14, 20, 21], "contextlib": 9, "exitstack": 9, "stack": [9, 19], "enter_context": 9, "metadata_filenam": 9, "execut": [9, 12, 13, 20, 21], "token": [9, 10, 14, 17, 19, 26, 30, 32, 33, 39, 43, 48], "authent": 9, "pdfminer": 9, "copi": 9, "protect": 9, "cannot": 9, "fail": [9, 12, 13], "issu": [9, 17, 20, 22, 24], "powerpoint": 9, "paragraph_group": 9, "group_broken_paragraph": 9, "yourself": 9, "explicitli": 9, "my_api_kei": 9, "messag": [9, 24], "rfc822": 9, "ad": [9, 10, 11, 14, 25, 42], "da": 9, "1p": 9, "api_url": 9, "5000": 9, "sheet": [9, 24], "xml_path": 9, "conjunt": 9, "restrict": 9, "factbook": 9, "packag": [10, 15, 16, 17, 24], "ingest": [10, 12, 13, 15, 19, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "dictionari": [10, 14, 19, 24], "labelstudio": 10, "upload": [10, 12, 13, 14, 19], "label": 10, "label_studio": [10, 14], "narrative_text": 10, "dump": [10, 14], "indent": [10, 14, 24], "isd": 10, "isd_csv": 10, "panda": [10, 14], "datafram": [10, 14, 19], "df": 10, "repres": [10, 20, 21, 22, 24], "prodigi": 10, "write": [10, 13, 15, 19], "prodigy_csv_data": 10, "w": [10, 14], "csv_file": 10, "argilla": 10, "dataset": [10, 19, 20, 21, 22, 23], "argilla_task": [10, 19], "text_classif": [10, 19], "token_classif": [10, 19], "text2text": [10, 19], "nltk": 10, "argilla_dataset": 10, "basepl": 10, "llm": [10, 19], "backend": [10, 19], "spreadsheet": [10, 19], "interfac": [10, 14, 19, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "elementmetadata": [10, 24], "wonder": 10, "stori": 10, "ran": 10, "chicken": 10, "coop": 10, "flew": 10, "row": [10, 19], "element_id": [10, 12, 20, 21], "ad270eefd1cc68d15f4d3e51666d4dc8": 10, "8275769fdd1804f9f2b55ad3c9b0ef1b": 10, "datasaur": 10, "text1": 10, "text2": 10, "datasaur_data": 10, "entiti": [10, 12, 19], "hi": [10, 24], "matt": 10, "start_idx": 10, "end_idx": 10, "labelbox": 10, "cloud": [10, 19, 25], "output_directori": [10, 19], "storag": [10, 11, 12, 13, 15, 19, 20, 22, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "servic": [10, 15, 19, 38], "config": [10, 12, 19], "dict": [10, 19], "written": [10, 12, 19], "s3": [10, 12, 13, 17, 19, 24, 25, 31], "aw": 10, "sync": 10, "url_prefix": 10, "demonstr": 10, "bucket": [10, 19], "label_box": 10, "s3_bucket_nam": 10, "s3_bucket_key_prefix": 10, "access": [10, 15, 20, 22, 24, 26, 43], "s3_url_prefix": 10, "amazonaw": 10, "local_output_directori": 10, "tmp": 10, "labelbox_config": 10, "external_id": 10, "id1": 10, "id2": 10, "raw_text": 10, "create_directori": 10, "labelbox_config_fil": 10, "upload_staged_fil": 10, "s3f": 10, "s3filesystem": 10, "listdir": 10, "filepath": 10, "upload_kei": 10, "put_fil": 10, "lpath": 10, "rpath": 10, "folder": [10, 19, 38, 43], "project": [10, 14, 19, 20, 22], "label_studio_data": [10, 14], "text_field": [10, 14], "my_text": 10, "id_field": [10, 14], "my_id": 10, "annot": [10, 14, 19], "labelstudioannot": [10, 14], "labelstudioresult": [10, 14], "posit": [10, 14], "from_nam": [10, 14], "sentiment": [10, 19], "to_nam": [10, 14], "score": [10, 14], "labelstudiopredict": [10, 14], "68": [10, 14], "misc": 10, "prodigy_data": 10, "jsonl": [10, 19], "feed": 10, "loader": [10, 19], "save_as_jsonl": 10, "fit": [10, 19, 20, 23], "attent": [10, 19], "window": [10, 12, 19], "autotoken": 10, "automodelfortokenclassif": 10, "huggingfac": 10, "hf": 10, "intern": 10, "test": [10, 12, 16, 29, 30, 37, 39], "tini": 10, "bert": 10, "from_pretrain": 10, "ner": 10, "frost": 10, "advisori": 10, "morn": 10, "strong": 10, "cold": 10, "front": 10, "later": [10, 20, 21], "week": 10, "chanc": 10, "refresh": 10, "crisp": 10, "air": 10, "pronounc": 10, "goe": 10, "were": 10, "place": [10, 19, 24], "across": [10, 20, 22], "portion": 10, "appalachian": 10, "coastal": 10, "temperatur": 10, "drop": 10, "40": 10, "far": 10, "south": 10, "florida": 10, "panhandl": 10, "And": [10, 20, 21, 23], "had": 10, "report": [10, 20, 21], "snow": 10, "season": 10, "sundai": 10, "citi": 10, "moder": 10, "dure": 10, "next": [10, 14, 20, 22], "much": 10, "east": [10, 13, 31], "stai": 10, "right": [10, 20, 21], "norm": 10, "blast": 10, "potenti": 10, "hazard": 10, "condit": 10, "weather": 10, "evolv": 10, "continu": [10, 24], "weekend": 10, "coupl": 10, "move": 10, "central": 10, "eastern": 10, "center": 10, "said": 10, "potent": 10, "canada": 10, "punch": 10, "chilli": 10, "heavi": 10, "rain": 10, "wind": 10, "slight": 10, "excess": 10, "rainfal": 10, "northeast": 10, "england": 10, "thursdai": 10, "york": 10, "buffalo": 10, "burlington": 10, "out": [10, 20, 21, 22], "flash": 10, "flood": 10, "confid": [10, 20, 22], "grow": 10, "region": [10, 20, 21], "experi": 10, "gusti": 10, "period": [10, 20, 22, 32], "along": [10, 14], "ahead": 10, "passag": 10, "nation": 10, "wrote": 10, "accompani": 10, "bring": 10, "inch": 10, "isol": 10, "locat": [10, 20, 22], "ensembl": 10, "forecast": 10, "median": 10, "total": 10, "wednesdai": 10, "night": 10, "half": 10, "spot": 10, "substanti": 10, "grand": 10, "rapid": 10, "enough": [10, 20, 22], "mix": 10, "fridai": 10, "especi": [10, 20, 22], "higher": 10, "terrain": 10, "north": 10, "toward": 10, "cadillac": 10, "mph": 10, "caus": 10, "tree": 10, "limb": 10, "sporad": 10, "outag": 10, "behind": 10, "coast": 10, "degre": 10, "workweek": 10, "go": [10, 14], "50": 10, "great": 10, "lake": [10, 13, 31], "explain": 10, "reinforc": 10, "shot": 10, "countri": 10, "keyword": 10, "buffer": [10, 19], "leav": [10, 20, 22], "cl": 10, "sequenc": 10, "max_input_s": 10, "size": [10, 14, 20, 22, 23], "model_max_length": 10, "split_funct": [10, 19], "space": 10, "chunk_separ": [10, 19], "concat": 10, "adjac": 10, "reconstruct": 10, "oper": [10, 19], "chunk_by_attention_window": [10, 19], "helper": [10, 19], "weaviat": [10, 20, 22], "databas": [10, 19, 21, 41], "create_unstructured_weaviate_class": 10, "class_nam": 10, "unstructured_class": 10, "unstructureddocu": 10, "8080": [10, 14], "batch": [10, 11, 12, 13, 15, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "been": 10, "generate_uuid5": 10, "data_object": 10, "batch_siz": 10, "tqdm": 10, "add_data_object": 10, "unstructured_class_nam": 10, "uuid": [10, 20, 21], "favorit": [11, 15, 19, 25], "effortless": [11, 15, 25], "constantli": [11, 25], "let": [11, 25], "u": [11, 13, 20, 21, 25, 31, 44], "slack": [11, 17, 25], "delta": [11, 17, 25], "azur": [11, 17, 24, 25, 42], "cognit": [11, 17], "search": [11, 17, 20, 22, 44], "record": [12, 13, 31], "filesystem": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "those": [12, 13, 20, 21], "pip": [12, 13, 17, 19, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "upstream": [12, 13, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "connector": [12, 13, 15, 17, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "ones": [12, 13, 20, 22], "conveni": [12, 13], "command": [12, 13, 14, 17], "utic": [12, 13, 29, 31, 37, 45], "dev": [12, 13, 17, 20, 21, 31, 45], "tech": [12, 13, 19, 31, 45], "fixtur": [12, 13, 29, 31, 37, 45], "small": [12, 20, 23, 45], "anonym": [12, 45], "dir": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "num": [12, 26, 27, 28, 29, 30, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49], "verbos": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "azure_search_api_kei": 12, "endpoint": [12, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "azure_search_endpoint": 12, "subprocess": [12, 13], "getenv": [12, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "popen": [12, 13], "stdout": [12, 13], "pipe": [12, 13], "error": [12, 13, 17, 20, 21, 22], "returncod": [12, 13], "successfulli": [12, 13], "els": [12, 13], "cli": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "mind": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "sure": [12, 20, 21], "being": [12, 24], "odata": 12, "context": [12, 20, 22], "net": [12, 30, 39], "etag": 12, "0x8dbb93e09c8f4bd": 12, "edm": 12, "collect": [12, 20, 22], "400": 12, "vectorsearchconfigur": 12, "complextyp": 12, "category_depth": [12, 24], "int32": 12, "parent_id": [12, 24], "attached_to_filenam": [12, 24], "last_modifi": [12, 24], "datetimeoffset": 12, "file_directori": [12, 24, 26, 30, 34, 39, 43], "data_sourc": [12, 26, 30, 34, 39, 43], "date_cr": 12, "date_modifi": 12, "date_process": [12, 26, 30, 34, 39, 43], "record_loc": 12, "coordin": 12, "layout_width": 12, "doubl": 12, "layout_height": 12, "page_numb": [12, 24], "link_url": [12, 24], "link_text": [12, 24], "sent_from": [12, 24], "sent_to": [12, 24], "subject": [12, 24], "emphasized_text_cont": [12, 24], "emphasized_text_tag": [12, 24], "regex_metadata": [12, 24], "detection_class_prob": [12, 24], "vectorsearch": 12, "algorithmconfigur": 12, "kind": 12, "hnsw": 12, "hnswparamet": 12, "metric": 12, "cosin": 12, "efconstruct": 12, "efsearch": 12, "rais": 13, "exist": [13, 14, 20, 22, 24], "uri": [13, 31], "deltat": [13, 31], "storage_opt": [13, 31], "aws_region": [13, 31], "aws_access_key_id": [13, 31], "aws_secret_access_kei": [13, 31], "json_data": 13, "dest": 13, "preserv": [13, 20, 22, 28, 32], "too": 14, "larg": [14, 21, 23], "repo": [14, 17, 20, 21], "sec": [14, 19], "assum": 14, "dummi": 14, "info": [14, 17], "edgar": 14, "stage_for_label_studio": [14, 19], "risk_sect": 14, "prepopul": 14, "ui": 14, "feel": 14, "free": 14, "step": [14, 17, 20, 21, 22], "append": 14, "final": [14, 20, 21], "omit": 14, "did": 14, "studio": 14, "setup": [14, 20, 21], "author": [14, 42], "labelstudio_token": 14, "project_id": 14, "to_dict": [14, 24], "exif": 14, "exif_data": 14, "file_util": 14, "get_jpg_metadata": 14, "get_docx_metadata": 14, "get_xlsx_metadata": 14, "tool": [14, 17, 20, 22], "get_directory_file_info": 14, "recurs": [14, 29, 33, 37, 38, 40, 41, 42, 43, 46, 47], "subdirectori": [14, 17], "file_info": 14, "value_count": 14, "dtype": 14, "int64": 14, "groupbi": 14, "mean": [14, 20, 22, 23], "files": 14, "660200e": 14, "490885e": 14, "05": 14, "228404e": 14, "06": 14, "276400e": 14, "429245e": 14, "832900e": 14, "6": 14, "113333e": 14, "02": [14, 48], "765000e": 14, "03": 14, "7": [14, 36], "135000e": 14, "advanc": [15, 20, 22, 23], "destin": 15, "track": [15, 24], "easi": [15, 19, 20, 22], "popular": 15, "ml": [15, 19], "practic": [15, 20, 22], "haven": 16, "howev": [16, 20, 22], "flag": [16, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "everi": [16, 17], "branch": [16, 17, 35, 36], "dt": 16, "bash": 16, "exec": 16, "plan": 16, "acceler": 16, "exclud": [16, 26, 30, 34, 39, 43], "dockerfil": 16, "necessari": [16, 17], "python3": 16, "partition_text": [16, 20, 21, 24], "complet": 17, "cater": [17, 20, 22], "beyond": 17, "airtabl": [17, 24, 25, 41, 42, 43, 49], "biom": [17, 25], "confluenc": [17, 24, 25], "discord": [17, 24, 25], "dropbox": [17, 24, 25], "elasticsearch": [17, 24, 25], "gc": [17, 24, 37], "github": [17, 25], "gitlab": [17, 25], "googl": [17, 24, 25], "drive": [17, 24, 25], "jira": [17, 24, 25], "notion": [17, 25], "onedr": [17, 24, 42], "reddit": [17, 25], "sharepoint": [17, 24, 25], "salesforc": [17, 25], "wikipedia": [17, 24, 25], "involv": [17, 20, 22], "anaconda": 17, "stackoverflow": 17, "pycocotool": 17, "env": 17, "yml": 17, "virtual": 17, "virtualenviron": 17, "challeng": 17, "offici": 17, "compat": 17, "altern": [17, 20, 22], "pip3": 17, "git": [17, 35, 36], "philferrier": 17, "cocoapi": 17, "egg": 17, "pythonapi": 17, "outlin": 17, "clone": 17, "ivanpp": 17, "cd": 17, "iopath": 17, "facebookresearch": 17, "Then": 17, "753": 17, "file_io": 17, "py": 17, "parsed_url": 17, "navig": 17, "verifi": [17, 20, 22], "root": 17, "esri": 17, "detectron2_tool": 17, "modul": 17, "conflict": 17, "describ": 17, "kmp_duplicate_lib_ok": 17, "prevent": 17, "libiomp5md": 17, "dll": 17, "link": [17, 18, 19, 24, 42], "numpi": 17, "np": 17, "img": 17, "arrai": 17, "lang": 17, "use_gpu": 17, "show_log": 17, "log_level": 17, "debug": 17, "mac": 17, "brew": 17, "One": [17, 20, 21, 22, 25], "debian": 17, "sudo": 17, "apt": 17, "y": [17, 24], "forg": 17, "libxml2": 17, "libxlst": 17, "libxslt": 17, "rust": 17, "properli": 17, "proto": 17, "tlsv1": 17, "ssf": 17, "sh": 17, "rustup": 17, "sentencepiec": 17, "earlier": 17, "while": 17, "remain": 17, "newer": [17, 20, 22], "backward": 17, "might": [17, 20, 21, 22], "deprec": 17, "futur": [17, 20, 21, 24], "releas": 17, "advis": 17, "transit": 17, "lot": [18, 20, 21, 22], "docker": 18, "develop": [19, 20, 22, 23], "framework": 19, "stage_for_argilla": 19, "stage_for_basepl": 19, "stage_for_datasaur": 19, "customis": 19, "stage_for_transform": 19, "window_s": 19, "stage_for_label_box": 19, "With": 19, "incredibli": 19, "matter": [19, 20, 23], "unstructuredfileload": 19, "document_load": 19, "state_of_the_union": 19, "checkout": 19, "gpt": [19, 20, 22], "llama": [19, 20, 22], "split_docu": 19, "separ": 19, "simpl": 19, "pathlib": 19, "llama_index": 19, "download_load": 19, "unstructuredread": 19, "load_data": 19, "10k_file": 19, "llamahub": 19, "convert_to_datafram": 19, "stage_for_prodigi": 19, "stage_csv_for_prodigi": 19, "compon": [19, 20, 21, 23], "emerg": 19, "stage_for_weavi": 19, "aim": [20, 23], "simplifi": [20, 23], "streamlin": [20, 23], "toolkit": [20, 23], "digest": [20, 22, 23], "usabl": [20, 22, 23], "softwar": [20, 23, 49], "scalabl": [20, 23], "quantiti": [20, 23], "enterpris": [20, 23], "hope": [20, 23], "late": [20, 23], "seamless": [20, 23], "classic": [20, 23], "modern": [20, 21, 23], "myriad": [20, 22, 23], "effici": [20, 22, 23], "regardless": [20, 23], "customiz": [20, 23], "extend": [20, 23], "fine": [20, 22, 23], "tune": [20, 22, 23], "etl": [20, 23], "eager": [20, 23], "dive": [20, 23], "over": [20, 23], "minut": [20, 23], "explor": [20, 23], "natur": [20, 22], "encompass": [20, 22], "broad": [20, 22], "spectrum": [20, 22], "methodologi": [20, 22], "introduc": [20, 22], "fundament": [20, 22], "crucial": [20, 22], "often": [20, 22], "signific": [20, 22], "segreg": [20, 22], "smaller": [20, 22], "manag": [20, 22], "anomali": [20, 22], "fill": [20, 22], "miss": [20, 22], "elimin": [20, 22], "irrelev": [20, 22], "erron": [20, 22], "significantli": [20, 22], "influenc": [20, 22], "outcom": [20, 22], "subsequ": [20, 21, 22], "consist": [20, 22], "divid": [20, 22, 24], "lengthi": [20, 22], "meaning": [20, 22], "textual": [20, 22], "fix": [20, 22], "semant": [20, 22], "cluster": [20, 22], "priorit": [20, 22], "aspect": [20, 21, 22], "manner": [20, 22], "foundat": [20, 22], "groundwork": [20, 22], "proper": [20, 22], "vastli": [20, 22], "improv": [20, 22], "decompos": [20, 22], "analyz": [20, 22], "vast": [20, 22], "amount": [20, 22], "capac": [20, 22], "comprehend": [20, 22], "human": [20, 22], "art": [20, 22], "multitud": [20, 22], "domain": [20, 22], "chatgpt": [20, 22], "anthrop": [20, 22], "claud": [20, 22], "revolution": [20, 22], "landscap": [20, 22], "prowess": [20, 22], "inher": [20, 22], "suffer": [20, 22], "drawback": [20, 22], "major": [20, 22], "static": [20, 22], "frozen": [20, 22], "instanc": [20, 22, 24], "knowledg": [20, 22], "limit": [20, 21, 22], "septemb": [20, 22], "blind": [20, 22], "despit": [20, 22], "respond": [20, 22], "unwarr": [20, 22], "phenomenon": [20, 22], "known": [20, 22], "hallucin": [20, 22], "Such": [20, 22], "detriment": [20, 22], "serv": [20, 22], "critic": [20, 22], "groundbreak": [20, 22], "counteract": [20, 22], "pair": [20, 22], "underli": [20, 22], "transpar": [20, 22], "approach": [20, 22], "claim": [20, 22], "accuraci": [20, 22], "trust": [20, 22], "among": [20, 22], "moreov": [20, 22], "cost": [20, 22], "effect": [20, 22], "solut": [20, 22], "financi": [20, 22], "burden": [20, 22], "finetun": [20, 22], "situat": [20, 22], "suffici": [20, 22], "reduct": [20, 22], "resourc": [20, 22], "consumpt": [20, 22], "particularli": [20, 22], "benefici": [20, 22], "organ": [20, 22], "lack": [20, 22], "deploi": [20, 22], "scratch": [20, 22], "acquir": [20, 22], "super": [20, 22], "ve": [20, 21, 22], "artifact": [20, 22], "unneccesari": [20, 22], "found": [20, 22], "consum": [20, 22, 46], "haystack": [20, 22], "funcion": [20, 22], "numer": [20, 22], "coher": [20, 22], "hug": [20, 22], "face": [20, 22], "pinecon": [20, 22], "milvu": [20, 22], "chromadd": [20, 22], "prompt": [20, 22], "blog": [20, 22], "concis": [20, 21], "swiftli": [20, 21], "sdk": [20, 21], "immedi": [20, 21], "vari": [20, 21], "poppler": [20, 21], "opt": [20, 21], "congratul": [20, 21], "success": [20, 21], "cover": [20, 21], "cut": [20, 21], "chase": [20, 21], "goal": [20, 21], "categor": [20, 21], "associ": [20, 21, 24], "cell": [20, 21], "observ": [20, 21], "figurecapt": [20, 21], "uncategorizedtext": [20, 21], "formula": [20, 21], "figur": [20, 21], "notic": [20, 21], "suitabl": [20, 21], "text_typ": [20, 21], "sentence_count": [20, 21], "100": [20, 21, 24], "isinst": [20, 21], "rel": [20, 21, 24], "would": [20, 21], "model1": [20, 21], "publaynet": [20, 21], "38": [20, 21], "scientif": [20, 21], "prima": [20, 21], "scan": [20, 21], "magazin": [20, 21], "newspap": [20, 21], "17": [20, 21], "20th": [20, 21], "centuri": [20, 21], "tablebank": [20, 21], "18": [20, 21], "busi": [20, 21], "hjdataset": [20, 21], "31": [20, 21], "histori": [20, 21], "japanes": [20, 21], "thead": [20, 21], "tr": [20, 21], "td": [20, 21], "convert_to_dict": [20, 21], "seen": [20, 21], "elements_to_json": [20, 21], "elements_from_json": [20, 21], "sha": [20, 21], "256": [20, 21], "determinist": [20, 21], "downsid": [20, 21], "collis": [20, 21], "unique_element_id": [20, 21], "conclud": [20, 21], "input_filenam": [20, 21], "output_filenam": [20, 21], "concept": 21, "product": 22, "uniqu": 23, "filter": 24, "last": 24, "xy": 24, "further": 24, "hierarchi": 24, "parent": 24, "resid": 24, "overal": 24, "depth": 24, "partition": 24, "processor": 24, "nativ": 24, "reflect": 24, "h1": 24, "h2": 24, "h3": 24, "probabl": 24, "emphas": 24, "bold": 24, "ital": 24, "origin": 24, "num_charact": 24, "max_charact": 24, "add_chunking_strategi": 24, "is_continu": 24, "previou": 24, "due": 24, "usual": 24, "corner": 24, "top": 24, "left": 24, "proceed": 24, "counter": 24, "clockwis": 24, "pixel": 24, "increas": 24, "downward": 24, "direct": 24, "typic": 24, "pixelspac": 24, "orient": 24, "width": 24, "height": 24, "convert_coordinates_to_new_system": 24, "in_plac": 24, "alter": 24, "relativecoordinatesystem": 24, "200": 24, "coordinate_system": 24, "850": 24, "1100": 24, "term": 24, "page_nam": 24, "even_onli": 24, "favor": 24, "rfc": 24, "822": 24, "spec": 24, "sent": [24, 43], "ever": 24, "view": 24, "fsspec": 24, "protocol": 24, "channel": [24, 32, 48], "pname": [24, 42], "speak": 24, "person": [26, 43], "airtable_personal_access_token": [26, 43], "reprocess": [26, 43], "partitionconfig": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "readconfig": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "runner": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "__name__": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "__main__": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "read_config": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "partition_config": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "output_dir": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "num_process": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "personal_access_token": 26, "partition_by_api": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "unstructured_api_kei": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "abf": 27, "container1": 27, "azureunstructured1": 27, "remote_url": [27, 29, 33, 37, 45], "account_nam": 27, "oa_pdf": 28, "07": 28, "sbaa031": 28, "073": 28, "pmc7234218": 28, "preserve_download": [28, 32], "box_app_config": 29, "box_app_config_path": 29, "atlassian": [30, 39], "12345678": [30, 32, 39, 48], "abcde1234abde1234abcde1234": [30, 39], "metadata_exclud": [30, 34, 39], "user_email": [30, 39, 43], "api_token": [30, 39], "delta_t": 31, "table_uri": 31, "discord_token": 32, "download_dir": 32, "dropbox_token": 33, "9200": 34, "movi": 34, "ethnic": 34, "director": 34, "plot": 34, "index_nam": 34, "jq_queri": 34, "git_branch": [35, 36], "docsi": 36, "gdrive": 38, "google_dr": 38, "drive_id": 38, "popul": [38, 41], "WITH": 38, "OR": 38, "service_account_kei": 38, "input_path": 40, "comma": 41, "delimit": 41, "page_id": 41, "OF": 41, "database_id": 41, "cred": [42, 43, 47], "secret": [42, 44, 47], "login": 42, "microsoftonlin": 42, "tenant": [42, 43, 47], "tenant_id": 42, "princip": 42, "client_id": [42, 43, 44, 47], "client_cr": [42, 43, 47], "authority_url": 42, "user_pnam": 42, "ms_client_id": 43, "ms_client_cr": 43, "ms_tenant_id": 43, "ms_user_email": 43, "inbox": 43, "outlook_fold": 43, "subreddit": 44, "machinelearn": 44, "fetcher": 44, "subreddit_nam": 44, "client_secret": 44, "user_ag": 44, "search_queri": 44, "num_post": 44, "usernam": 46, "salesforce_usernam": 46, "salesforce_consumer_kei": 46, "privat": 46, "salesforce_private_key_path": 46, "emailmessag": 46, "consumer_kei": 46, "private_key_path": 46, "contoso": 47, "admin": 47, "share": 47, "files_onli": 47, "01t01": 48, "00": 48, "08": 48, "start_dat": 48, "end_dat": 48, "page_titl": 49, "auto_suggest": 49}, "objects": {}, "objtypes": {}, "objnames": {}, "titleterms": {"unstructur": [0, 15, 17], "api": [0, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "support": 0, "file": 0, "type": [0, 24], "paramet": 0, "coordin": [0, 24], "encod": 0, "ocr": 0, "languag": [0, 20, 22], "output": 0, "format": 0, "page": 0, "break": 0, "strategi": [0, 3], "beta": 0, "version": [0, 17], "hi_r": 0, "chipper": 0, "model": [0, 2, 20, 22], "tabl": [0, 13, 20, 21, 31], "extract": [0, 8, 14, 24], "pdf": 0, "other": 0, "filetyp": [0, 17], "xml": [0, 17], "tag": 0, "us": [0, 2, 16, 20, 23], "local": [0, 12, 13, 17, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "docker": [0, 16], "imag": [0, 16], "develop": 0, "best": 1, "practic": 1, "non": 2, "default": 2, "bring": 2, "your": [2, 16], "own": [2, 16], "brick": 4, "chunk": [5, 20, 22], "chunk_by_titl": 5, "clean": 6, "bytes_string_to_str": 6, "clean_bullet": 6, "clean_dash": 6, "clean_extra_whitespac": 6, "clean_non_ascii_char": 6, "clean_ordered_bullet": 6, "clean_postfix": 6, "clean_prefix": 6, "clean_trailing_punctu": 6, "group_broken_paragraph": [6, 8], "remove_punctu": [6, 8], "replace_unicode_quot": [6, 8], "translate_text": [6, 8], "embed": [7, 20, 22], "baseembeddingencod": 7, "openaiembeddingencod": 7, "extract_datetimetz": 8, "extract_email_address": 8, "extract_ip_address": 8, "extract_ip_address_nam": 8, "extract_mapi_id": 8, "extract_ordered_bullet": 8, "extract_text_aft": 8, "extract_text_befor": 8, "extract_us_phone_numb": 8, "partit": [9, 20, 21], "partition_csv": 9, "partition_doc": 9, "partition_docx": 9, "partition_email": 9, "partition_epub": 9, "partition_html": 9, "partition_imag": 9, "partition_md": 9, "partition_msg": 9, "partition_multiple_via_api": 9, "partition_odt": 9, "partition_org": 9, "partition_pdf": 9, "partition_ppt": 9, "partition_pptx": 9, "partition_rst": 9, "partition_rtf": 9, "partition_text": 9, "partition_tsv": 9, "partition_via_api": 9, "partition_xlsx": 9, "partition_xml": 9, "stage": 10, "convert_to_csv": 10, "convert_to_datafram": 10, "convert_to_dict": 10, "dict_to_el": 10, "stage_csv_for_prodigi": 10, "stage_for_argilla": 10, "stage_for_basepl": 10, "stage_for_datasaur": 10, "stage_for_label_box": 10, "stage_for_label_studio": 10, "stage_for_prodigi": 10, "stage_for_transform": 10, "stage_for_weavi": 10, "destin": 11, "connector": [11, 24, 25], "azur": [12, 27], "cognit": 12, "search": 12, "run": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "sampl": 12, "index": 12, "schema": 12, "delta": [13, 31], "exampl": 14, "sentiment": 14, "analysi": 14, "label": [14, 19], "labelstudio": 14, "metadata": [14, 24], "from": 14, "document": [14, 15, 20, 21, 24], "explor": 14, "sourc": [14, 25], "core": 15, "librari": 15, "instal": [16, 17, 18, 20, 21], "prerequisit": 16, "pull": 16, "build": 16, "interact": 16, "python": 16, "insid": 16, "contain": 16, "full": 17, "conda": 17, "window": 17, "set": 17, "up": [17, 20, 21], "infer": 17, "paddleocr": 17, "log": 17, "extra": 17, "depend": 17, "detect": 17, "html": 17, "huggingfac": 17, "note": 17, "older": 17, "integr": 19, "argilla": 19, "basepl": 19, "datasaur": 19, "hug": 19, "face": 19, "labelbox": 19, "studio": 19, "langchain": 19, "llamaindex": 19, "panda": 19, "prodigi": 19, "weaviat": 19, "introduct": [20, 23], "overview": [20, 23], "product": [20, 23], "offer": [20, 23], "kei": [20, 22, 23], "featur": [20, 23], "common": [20, 23, 24], "case": [20, 23], "quickstart": [20, 23], "tutori": [20, 23], "concept": [20, 22], "data": [20, 22, 24], "ingest": [20, 22], "preprocess": [20, 22], "text": [20, 22], "vector": [20, 22], "databas": [20, 22], "token": [20, 22], "larg": [20, 22], "llm": [20, 22], "retriev": [20, 22], "augment": [20, 22], "gener": [20, 22], "get": [20, 21], "start": [20, 21], "quick": [20, 21], "valid": [20, 21], "element": [20, 21], "convert": [20, 21], "dictionari": [20, 21], "json": [20, 21], "uniqu": [20, 21], "id": [20, 21], "wrap": [20, 21], "all": [20, 21], "field": 24, "addit": 24, "email": 24, "microsoft": 24, "excel": 24, "word": 24, "via": [24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "record": 24, "locat": 24, "advanc": 24, "option": 24, "regex": 24, "airtabl": 26, "biom": 28, "box": 29, "confluenc": 30, "discord": 32, "dropbox": 33, "elasticsearch": 34, "github": 35, "gitlab": 36, "googl": [37, 38], "cloud": 37, "storag": 37, "drive": [38, 42], "jira": 39, "notion": 41, "One": 42, "outlook": 43, "reddit": 44, "s3": 45, "salesforc": 46, "sharepoint": 47, "slack": 48, "wikipedia": 49}, "envversion": {"sphinx.domains.c": 2, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 8, "sphinx.domains.index": 1, "sphinx.domains.javascript": 2, "sphinx.domains.math": 2, "sphinx.domains.python": 3, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 57}, "alltitles": {"Unstructured API": [[0, "unstructured-api"]], "Supported File Types": [[0, "supported-file-types"]], "Parameters": [[0, "parameters"]], "Coordinates": [[0, "coordinates"], [24, "coordinates"]], "Encoding": [[0, "encoding"]], "OCR Languages": [[0, "ocr-languages"]], "Output Format": [[0, "output-format"]], "Page Break": [[0, "page-break"]], "Strategies": [[0, "strategies"], [3, "strategies"]], "Beta Version: hi_res Strategy with Chipper Model": [[0, "beta-version-hi-res-strategy-with-chipper-model"]], "Table Extraction": [[0, "table-extraction"]], "PDF Table Extraction": [[0, "pdf-table-extraction"]], "Table Extraction for other filetypes": [[0, "table-extraction-for-other-filetypes"]], "XML Tags": [[0, "xml-tags"]], "Using the API Locally": [[0, "using-the-api-locally"]], "Using Docker Images": [[0, "using-docker-images"]], "Developing with the API Locally": [[0, "developing-with-the-api-locally"]], "Best Practices": [[1, "best-practices"]], "Models": [[2, "models"]], "Using a Non-Default Model": [[2, "using-a-non-default-model"]], "Bring Your Own Models": [[2, "bring-your-own-models"]], "Bricks": [[4, "bricks"]], "Chunking": [[5, "chunking"]], "chunk_by_title": [[5, "chunk-by-title"]], "Cleaning": [[6, "cleaning"]], "bytes_string_to_string": [[6, "bytes-string-to-string"]], "clean": [[6, "clean"]], "clean_bullets": [[6, "clean-bullets"]], "clean_dashes": [[6, "clean-dashes"]], "clean_extra_whitespace": [[6, "clean-extra-whitespace"]], "clean_non_ascii_chars": [[6, "clean-non-ascii-chars"]], "clean_ordered_bullets": [[6, "clean-ordered-bullets"]], "clean_postfix": [[6, "clean-postfix"]], "clean_prefix": [[6, "clean-prefix"]], "clean_trailing_punctuation": [[6, "clean-trailing-punctuation"]], "group_broken_paragraphs": [[6, "group-broken-paragraphs"], [8, "group-broken-paragraphs"]], "remove_punctuation": [[6, "remove-punctuation"], [8, "remove-punctuation"]], "replace_unicode_quotes": [[6, "replace-unicode-quotes"], [8, "replace-unicode-quotes"]], "translate_text": [[6, "translate-text"], [8, "translate-text"]], "Embedding": [[7, "embedding"]], "BaseEmbeddingEncoder": [[7, "baseembeddingencoder"]], "OpenAIEmbeddingEncoder": [[7, "openaiembeddingencoder"]], "Extracting": [[8, "extracting"]], "extract_datetimetz": [[8, "extract-datetimetz"]], "extract_email_address": [[8, "extract-email-address"]], "extract_ip_address": [[8, "extract-ip-address"]], "extract_ip_address_name": [[8, "extract-ip-address-name"]], "extract_mapi_id": [[8, "extract-mapi-id"]], "extract_ordered_bullets": [[8, "extract-ordered-bullets"]], "extract_text_after": [[8, "extract-text-after"]], "extract_text_before": [[8, "extract-text-before"]], "extract_us_phone_number": [[8, "extract-us-phone-number"]], "Partitioning": [[9, "partitioning"]], "partition": [[9, "partition"]], "partition_csv": [[9, "partition-csv"]], "partition_doc": [[9, "partition-doc"]], "partition_docx": [[9, "partition-docx"]], "partition_email": [[9, "partition-email"]], "partition_epub": [[9, "partition-epub"]], "partition_html": [[9, "partition-html"]], "partition_image": [[9, "partition-image"]], "partition_md": [[9, "partition-md"]], "partition_msg": [[9, "partition-msg"]], "partition_multiple_via_api": [[9, "partition-multiple-via-api"]], "partition_odt": [[9, "partition-odt"]], "partition_org": [[9, "partition-org"]], "partition_pdf": [[9, "partition-pdf"]], "partition_ppt": [[9, "partition-ppt"]], "partition_pptx": [[9, "partition-pptx"]], "partition_rst": [[9, "partition-rst"]], "partition_rtf": [[9, "partition-rtf"]], "partition_text": [[9, "partition-text"]], "partition_tsv": [[9, "partition-tsv"]], "partition_via_api": [[9, "partition-via-api"]], "partition_xlsx": [[9, "partition-xlsx"]], "partition_xml": [[9, "partition-xml"]], "Staging": [[10, "staging"]], "convert_to_csv": [[10, "convert-to-csv"]], "convert_to_dataframe": [[10, "convert-to-dataframe"]], "convert_to_dict": [[10, "convert-to-dict"]], "dict_to_elements": [[10, "dict-to-elements"]], "stage_csv_for_prodigy": [[10, "stage-csv-for-prodigy"]], "stage_for_argilla": [[10, "stage-for-argilla"]], "stage_for_baseplate": [[10, "stage-for-baseplate"]], "stage_for_datasaur": [[10, "stage-for-datasaur"]], "stage_for_label_box": [[10, "stage-for-label-box"]], "stage_for_label_studio": [[10, "stage-for-label-studio"]], "stage_for_prodigy": [[10, "stage-for-prodigy"]], "stage_for_transformers": [[10, "stage-for-transformers"]], "stage_for_weaviate": [[10, "stage-for-weaviate"]], "Destination Connectors": [[11, "destination-connectors"]], "Azure Cognitive Search": [[12, "azure-cognitive-search"]], "Run Locally": [[12, "run-locally"], [13, "run-locally"], [26, "run-locally"], [27, "run-locally"], [28, "run-locally"], [29, "run-locally"], [30, "run-locally"], [31, "run-locally"], [32, "run-locally"], [33, "run-locally"], [34, "run-locally"], [35, "run-locally"], [36, "run-locally"], [37, "run-locally"], [38, "run-locally"], [39, "run-locally"], [40, "run-locally"], [41, "run-locally"], [42, "run-locally"], [43, "run-locally"], [44, "run-locally"], [45, "run-locally"], [46, "run-locally"], [47, "run-locally"], [48, "run-locally"], [49, "run-locally"]], "Sample Index Schema": [[12, "sample-index-schema"]], "Delta Table": [[13, "delta-table"], [31, "delta-table"]], "Examples": [[14, "examples"]], "Sentiment Analysis Labeling in LabelStudio": [[14, "sentiment-analysis-labeling-in-labelstudio"]], "Extracting Metadata from Documents": [[14, "extracting-metadata-from-documents"]], "Exploring Source Documents": [[14, "exploring-source-documents"]], "Unstructured Core Library": [[15, "unstructured-core-library"]], "Library Documentation": [[15, "library-documentation"]], "Docker Installation": [[16, "docker-installation"]], "Prerequisites": [[16, "prerequisites"]], "Pulling the Docker Image": [[16, "pulling-the-docker-image"]], "Using the Docker Image": [[16, "using-the-docker-image"]], "Building Your Own Docker Image": [[16, "building-your-own-docker-image"]], "Interacting with Python Inside the Container": [[16, "interacting-with-python-inside-the-container"]], "Full Installation": [[17, "full-installation"]], "Installation with conda on Windows": [[17, "installation-with-conda-on-windows"]], "Setting up unstructured for local inference": [[17, "setting-up-unstructured-for-local-inference"]], "Installing PaddleOCR": [[17, "installing-paddleocr"]], "Logging": [[17, "logging"]], "Extra Dependencies": [[17, "extra-dependencies"]], "Filetype Detection": [[17, "filetype-detection"]], "XML/HTML Dependencies": [[17, "xml-html-dependencies"]], "Huggingface Dependencies": [[17, "huggingface-dependencies"]], "Note on Older Versions": [[17, "note-on-older-versions"]], "Installation": [[18, "installation"]], "Integrations": [[19, "integrations"]], "Integration with Argilla": [[19, "integration-with-argilla"]], "Integration with Baseplate": [[19, "integration-with-baseplate"]], "Integration with Datasaur": [[19, "integration-with-datasaur"]], "Integration with Hugging Face": [[19, "integration-with-hugging-face"]], "Integration with Labelbox": [[19, "integration-with-labelbox"]], "Integration with Label Studio": [[19, "integration-with-label-studio"]], "Integration with LangChain": [[19, "integration-with-langchain"]], "Integration with LlamaIndex": [[19, "integration-with-llamaindex"]], "Integration with Pandas": [[19, "integration-with-pandas"]], "Integration with Prodigy": [[19, "integration-with-prodigy"]], "Integration with Weaviate": [[19, "integration-with-weaviate"]], "Introduction": [[20, "introduction"], [20, "id1"], [23, "introduction"]], "Overview": [[20, "overview"], [23, "overview"]], "Product Offerings": [[20, "product-offerings"], [23, "product-offerings"]], "Key Features": [[20, "key-features"], [23, "key-features"]], "Common Use Cases": [[20, "common-use-cases"], [23, "common-use-cases"]], "Quickstart Tutorial": [[20, "quickstart-tutorial"], [23, "quickstart-tutorial"]], "Key Concepts": [[20, "key-concepts"], [22, "key-concepts"]], "Data Ingestion": [[20, "data-ingestion"], [22, "data-ingestion"]], "Data Preprocessing": [[20, "data-preprocessing"], [22, "data-preprocessing"]], "Chunking Text for Vector Databases": [[20, "chunking-text-for-vector-databases"], [22, "chunking-text-for-vector-databases"]], "Embeddings": [[20, "embeddings"], [22, "embeddings"]], "Vector Databases": [[20, "vector-databases"], [22, "vector-databases"]], "Tokens": [[20, "tokens"], [22, "tokens"]], "Large Language Models (LLMs)": [[20, "large-language-models-llms"], [22, "large-language-models-llms"]], "Retrieval Augmented Generation": [[20, "retrieval-augmented-generation"], [22, "retrieval-augmented-generation"]], "Getting Started": [[20, "getting-started"], [21, "getting-started"]], "Quick Installation": [[20, "quick-installation"], [21, "quick-installation"]], "Validating Installation": [[20, "validating-installation"], [21, "validating-installation"]], "Partitioning a document": [[20, "partitioning-a-document"], [21, "partitioning-a-document"]], "Document elements": [[20, "document-elements"], [21, "document-elements"]], "Elements": [[20, "elements"], [21, "elements"]], "Tables": [[20, "tables"], [21, "tables"]], "Converting Elements to Dictionary or JSON": [[20, "converting-elements-to-dictionary-or-json"], [21, "converting-elements-to-dictionary-or-json"]], "Unique Element IDs": [[20, "unique-element-ids"], [21, "unique-element-ids"]], "Wrapping it all up": [[20, "wrapping-it-all-up"], [21, "wrapping-it-all-up"]], "Metadata": [[24, "metadata"]], "Common Metadata Fields": [[24, "common-metadata-fields"]], "Additional Metadata Fields by Document Type": [[24, "additional-metadata-fields-by-document-type"]], "Email": [[24, "email"]], "Microsoft Excel Documents": [[24, "microsoft-excel-documents"]], "Microsoft Word Documents": [[24, "microsoft-word-documents"]], "Data Connector Metadata Fields": [[24, "data-connector-metadata-fields"]], "Common Data Connector Metadata Fields": [[24, "common-data-connector-metadata-fields"]], "Additional Metadata Fields by Connector Type (via record locator)": [[24, "additional-metadata-fields-by-connector-type-via-record-locator"]], "Advanced Metadata Options": [[24, "advanced-metadata-options"]], "Extract Metadata with Regexes": [[24, "extract-metadata-with-regexes"]], "Source Connectors": [[25, "source-connectors"]], "Airtable": [[26, "airtable"]], "Run via the API": [[26, "run-via-the-api"], [27, "run-via-the-api"], [28, "run-via-the-api"], [29, "run-via-the-api"], [30, "run-via-the-api"], [31, "run-via-the-api"], [32, "run-via-the-api"], [33, "run-via-the-api"], [34, "run-via-the-api"], [35, "run-via-the-api"], [36, "run-via-the-api"], [37, "run-via-the-api"], [38, "run-via-the-api"], [39, "run-via-the-api"], [40, "run-via-the-api"], [41, "run-via-the-api"], [42, "run-via-the-api"], [43, "run-via-the-api"], [44, "run-via-the-api"], [45, "run-via-the-api"], [46, "run-via-the-api"], [47, "run-via-the-api"], [48, "run-via-the-api"], [49, "run-via-the-api"]], "Azure": [[27, "azure"]], "Biomed": [[28, "biomed"]], "Box": [[29, "box"]], "Confluence": [[30, "confluence"]], "Discord": [[32, "discord"]], "Dropbox": [[33, "dropbox"]], "Elasticsearch": [[34, "elasticsearch"]], "Github": [[35, "github"]], "Gitlab": [[36, "gitlab"]], "Google Cloud Storage": [[37, "google-cloud-storage"]], "Google Drive": [[38, "google-drive"]], "Jira": [[39, "jira"]], "Local": [[40, "local"]], "Notion": [[41, "notion"]], "One Drive": [[42, "one-drive"]], "Outlook": [[43, "outlook"]], "Reddit": [[44, "reddit"]], "S3": [[45, "s3"]], "Salesforce": [[46, "salesforce"]], "Sharepoint": [[47, "sharepoint"]], "Slack": [[48, "slack"]], "Wikipedia": [[49, "wikipedia"]]}, "indexentries": {}})
\ No newline at end of file
+Search.setIndex({"docnames": ["api", "best_practices", "best_practices/models", "best_practices/strategies", "bricks", "bricks/chunking", "bricks/cleaning", "bricks/embedding", "bricks/extracting", "bricks/partition", "bricks/staging", "destination_connectors", "destination_connectors/azure_cognitive_search", "destination_connectors/delta_table", "examples", "index", "installation/docker", "installation/full_installation", "installing", "integrations", "introduction", "introduction/getting_started", "introduction/key_concepts", "introduction/overview", "metadata", "source_connectors", "source_connectors/airtable", "source_connectors/azure", "source_connectors/biomed", "source_connectors/box", "source_connectors/confluence", "source_connectors/delta_table", "source_connectors/discord", "source_connectors/dropbox", "source_connectors/elasticsearch", "source_connectors/github", "source_connectors/gitlab", "source_connectors/google_cloud_storage", "source_connectors/google_drive", "source_connectors/jira", "source_connectors/local_connector", "source_connectors/notion", "source_connectors/onedrive", "source_connectors/outlook", "source_connectors/reddit", "source_connectors/s3", "source_connectors/salesforce", "source_connectors/sharepoint", "source_connectors/slack", "source_connectors/wikipedia"], "filenames": ["api.rst", "best_practices.rst", "best_practices/models.rst", "best_practices/strategies.rst", "bricks.rst", "bricks/chunking.rst", "bricks/cleaning.rst", "bricks/embedding.rst", "bricks/extracting.rst", "bricks/partition.rst", "bricks/staging.rst", "destination_connectors.rst", "destination_connectors/azure_cognitive_search.rst", "destination_connectors/delta_table.rst", "examples.rst", "index.rst", "installation/docker.rst", "installation/full_installation.rst", "installing.rst", "integrations.rst", "introduction.rst", "introduction/getting_started.rst", "introduction/key_concepts.rst", "introduction/overview.rst", "metadata.rst", "source_connectors.rst", "source_connectors/airtable.rst", "source_connectors/azure.rst", "source_connectors/biomed.rst", "source_connectors/box.rst", "source_connectors/confluence.rst", "source_connectors/delta_table.rst", "source_connectors/discord.rst", "source_connectors/dropbox.rst", "source_connectors/elasticsearch.rst", "source_connectors/github.rst", "source_connectors/gitlab.rst", "source_connectors/google_cloud_storage.rst", "source_connectors/google_drive.rst", "source_connectors/jira.rst", "source_connectors/local_connector.rst", "source_connectors/notion.rst", "source_connectors/onedrive.rst", "source_connectors/outlook.rst", "source_connectors/reddit.rst", "source_connectors/s3.rst", "source_connectors/salesforce.rst", "source_connectors/sharepoint.rst", "source_connectors/slack.rst", "source_connectors/wikipedia.rst"], "titles": ["Unstructured API", "Best Practices", "Models", "Strategies", "Bricks", "Chunking", "Cleaning", "Embedding", "Extracting", "Partitioning", "Staging", "Destination Connectors", "Azure Cognitive Search", "Delta Table", "Examples", "Unstructured Core Library", "Docker Installation", "Full Installation", "Installation", "Integrations", "Introduction", "Getting Started", "Key Concepts", "Overview", "Metadata", "Source Connectors", "Airtable", "Azure", "Biomed", "Box", "Confluence", "Delta Table", "Discord", "Dropbox", "Elasticsearch", "Github", "Gitlab", "Google Cloud Storage", "Google Drive", "Jira", "Local", "Notion", "One Drive", "Outlook", "Reddit", "S3", "Salesforce", "Sharepoint", "Slack", "Wikipedia"], "terms": {"try": [0, 9, 20, 21], "our": [0, 10, 11, 14, 16, 19, 20, 22, 23, 25], "host": [0, 9, 10, 15, 19, 20, 23], "It": [0, 2, 9, 10, 17, 19, 24], "": [0, 2, 6, 8, 10, 16, 17, 18, 19, 20, 22, 23, 24, 47], "freeli": 0, "avail": [0, 1, 2, 3, 9, 14, 17, 19, 20, 21, 24], "ani": [0, 2, 7, 9, 10, 12, 13, 17, 19, 20, 21, 22], "list": [0, 2, 5, 6, 7, 8, 9, 10, 12, 13, 19, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "abov": [0, 9, 10, 20, 21, 24], "thi": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 17, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "i": [0, 2, 3, 5, 6, 7, 8, 9, 10, 14, 15, 17, 19, 20, 21, 22, 23, 24, 42], "easiest": [0, 9, 20, 23], "wai": [0, 2, 3, 9, 19, 20, 23], "get": [0, 6, 9, 10, 14, 17, 19, 23], "start": [0, 2, 5, 10, 14, 16, 23, 24, 42, 48], "all": [0, 3, 6, 8, 9, 10, 12, 13, 15, 17, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "you": [0, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "need": [0, 2, 6, 7, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "an": [0, 2, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 17, 19, 20, 21, 22, 23, 24], "kei": [0, 7, 9, 10, 12, 19, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "can": [0, 2, 3, 6, 8, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "your": [0, 3, 6, 9, 10, 11, 12, 13, 14, 15, 17, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "here": [0, 2, 6, 8, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "now": [0, 20, 21, 24], "todai": 0, "quick": 0, "exampl": [0, 2, 5, 6, 7, 8, 9, 10, 13, 15, 16, 17, 19, 20, 21, 24, 31, 32, 40], "shell": [0, 12, 13, 16, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "python": [0, 9, 12, 13, 14, 20, 21, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "curl": [0, 14, 17], "x": [0, 14], "post": [0, 14, 17, 20, 22, 24, 44], "http": [0, 5, 7, 9, 10, 12, 14, 17, 30, 34, 36, 39, 42, 47], "io": [0, 9, 16, 19, 30, 35, 39], "gener": [0, 4, 5, 7, 9, 10, 14, 19, 23], "v0": [0, 9, 17, 36], "h": [0, 14], "accept": [0, 2, 9, 10, 12, 13, 19, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "applic": [0, 6, 9, 16, 20, 22, 23, 24], "json": [0, 4, 10, 12, 14, 19, 24], "content": [0, 4, 6, 9, 10, 19, 20, 21, 24, 36], "multipart": 0, "form": [0, 19, 24], "data": [0, 4, 6, 10, 11, 12, 13, 14, 15, 16, 17, 19, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "f": [0, 9, 10, 14, 17, 20, 21], "sampl": [0, 2, 10, 13, 14, 31], "doc": [0, 2, 9, 10, 14, 16, 17, 19, 20, 21, 23, 24, 40], "famili": 0, "dai": [0, 10], "eml": [0, 8, 9, 14, 20, 21, 24], "jq": [0, 34], "c": [0, 17], "less": 0, "r": [0, 6, 8, 9, 17, 24, 44], "import": [0, 2, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "request": [0, 9], "url": [0, 5, 9, 10, 12, 14, 17, 24, 27, 29, 30, 33, 34, 35, 36, 37, 39, 42, 45], "header": [0, 9, 20, 21, 24], "auto": [0, 3, 9, 20, 21], "file_path": 0, "path": [0, 9, 10, 14, 17, 19, 24, 28, 38, 40, 42, 46, 47], "To": [0, 2, 6, 7, 9, 12, 14, 16, 17, 19, 20, 21], "file_data": 0, "open": [0, 9, 10, 14, 17, 19, 20, 21, 23, 49], "rb": [0, 9, 14, 20, 21], "respons": [0, 9, 14, 20, 22], "close": [0, 10], "json_respons": 0, "below": [0, 6, 9, 10, 14, 16, 18, 20, 21, 24], "find": [0, 9, 12, 13, 14, 16, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "more": [0, 6, 8, 9, 10, 12, 13, 15, 19, 20, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "comprehens": [0, 17], "overview": [0, 1], "capabl": [0, 20, 23], "For": [0, 6, 8, 9, 10, 12, 13, 14, 17, 18, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "detail": [0, 10, 19, 20, 22, 23, 24], "inform": [0, 1, 3, 6, 8, 9, 10, 12, 13, 15, 17, 19, 20, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "schema": [0, 10], "refer": [0, 16], "document": [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 17, 19, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "note": [0, 2, 10, 12, 13, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "also": [0, 5, 6, 8, 9, 10, 14, 16, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "check": [0, 6, 8, 9, 10, 12, 13, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "section": [0, 4, 5, 6, 8, 9, 12, 14, 17, 19, 20, 21, 22, 24], "categori": [0, 20, 21, 24, 46], "plaintext": 0, "html": [0, 5, 6, 9, 14, 15, 19, 20, 21, 24], "md": [0, 9, 17], "msg": [0, 9, 17, 24], "rst": [0, 9, 17], "rtf": [0, 9, 17, 20, 21], "txt": [0, 6, 8, 9, 14, 16, 19], "jpeg": 0, "png": [0, 3, 9], "csv": [0, 4, 9, 10, 17, 19], "docx": [0, 9, 14, 17, 24], "epub": [0, 9, 10, 17, 20, 21, 24], "odt": [0, 9, 17], "ppt": [0, 9, 17, 24], "pptx": [0, 9, 14, 17], "tsv": [0, 9, 17], "xlsx": [0, 9, 14, 17, 24], "current": [0, 9, 14, 20, 21], "pipelin": [0, 2, 10, 14, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "recogn": [0, 9], "choos": [0, 3, 6, 9, 17, 20, 22], "relev": [0, 20, 22, 24], "partit": [0, 2, 3, 4, 5, 6, 7, 10, 15, 16, 17, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "function": [0, 2, 3, 4, 5, 6, 8, 9, 10, 14, 19, 20, 21], "process": [0, 4, 6, 9, 10, 11, 12, 13, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "when": [0, 2, 5, 6, 9, 10, 14, 20, 21, 22, 24], "element": [0, 2, 3, 4, 5, 6, 7, 9, 10, 14, 16, 19, 22, 24], "ar": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 17, 19, 20, 21, 22, 24, 25], "from": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "mai": [0, 9, 10, 19, 20, 21, 24], "bound": [0, 24], "box": [0, 17, 24, 25], "well": [0, 14], "set": [0, 2, 5, 6, 7, 8, 9, 12, 14, 20, 23, 24, 45], "true": [0, 5, 6, 8, 9, 10, 12, 17, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "add": [0, 20, 21], "field": [0, 8, 9, 12], "layout": [0, 2, 3, 9, 10, 16, 17, 20, 21, 24], "parser": [0, 2, 6, 9, 10, 16, 17, 20, 21], "paper": [0, 2, 9, 10, 16, 17, 20, 21], "specifi": [0, 2, 3, 5, 6, 8, 9, 10, 16, 19, 24], "decod": [0, 12, 13], "text": [0, 2, 3, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 17, 19, 21, 24], "input": [0, 5, 6, 8, 9, 10, 19, 40], "If": [0, 2, 5, 6, 8, 9, 10, 14, 16, 17, 20, 21, 22, 23, 24], "valu": [0, 5, 9, 10, 14, 19, 20, 22, 24], "provid": [0, 1, 2, 9, 17, 20, 22], "utf": [0, 6], "8": [0, 6, 17], "fake": [0, 9, 16, 20, 21], "power": [0, 9, 10, 15], "point": [0, 6, 8, 9, 12, 14, 17, 24], "utf_8": 0, "what": [0, 9, 19, 20, 23], "ocr_languag": [0, 9], "kwarg": [0, 3, 5, 6, 8, 9, 10, 14, 24], "see": [0, 5, 6, 7, 9, 10, 11, 14, 17, 19, 20, 21, 24, 25], "tesseract": [0, 9, 20, 21], "full": [0, 6, 9, 10, 12, 13, 18, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "instal": [0, 9, 12, 13, 14, 15, 19, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "instruct": [0, 9, 14, 15, 16, 17, 20, 21], "onli": [0, 9, 10, 16, 17, 20, 21, 22, 24, 47], "appli": [0, 6, 9, 17, 24], "alreadi": [0, 9, 13], "english": [0, 9], "korean": [0, 9], "ocr_onli": [0, 3, 9], "eng": [0, 9], "kor": [0, 9], "By": [0, 6, 8, 9, 10, 17, 20, 21, 22], "default": [0, 3, 5, 6, 8, 9, 10, 17, 20, 21, 42], "result": [0, 2, 9, 10, 14, 15, 17, 19, 20, 22, 24], "output_format": 0, "pass": [0, 2, 6, 9, 14, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "include_page_break": [0, 9], "includ": [0, 3, 5, 6, 7, 9, 10, 14, 15, 17, 19, 20, 21, 22, 24], "pagebreak": [0, 9, 20, 21], "four": 0, "fast": [0, 2, 3, 9, 10, 12, 16, 17], "work": [0, 6, 8, 9, 17], "do": [0, 5, 6, 9, 10, 14, 17], "have": [0, 9, 10, 12, 13, 14, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "embed": [0, 4, 12, 19], "On": [0, 17], "hand": [0, 20, 23], "better": 0, "choic": [0, 10, 14], "within": [0, 2, 9, 15, 20, 22, 24, 47], "achiev": [0, 20, 22], "greater": 0, "precis": 0, "Be": 0, "awar": 0, "take": [0, 6, 8, 9, 10, 17, 19, 20, 22], "20": 0, "time": [0, 2, 6, 8, 20, 22], "longer": 0, "compar": 0, "option": [0, 3, 6, 8, 9, 10, 12, 13, 14, 19, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "make": [0, 9, 12, 15, 16, 19, 20, 21, 22], "The": [0, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24], "run": [0, 2, 7, 9, 10, 14, 16, 17, 19, 20, 21, 23], "through": [0, 9, 14, 15, 24], "ha": [0, 6, 9, 10, 19, 24], "difficulti": [0, 9], "order": [0, 9, 14, 17, 24], "multipl": [0, 9, 24], "column": [0, 9, 10, 13, 19], "recommend": [0, 3, 9, 10, 17, 20, 21, 22], "pleas": [0, 18], "fall": [0, 2, 9, 10, 20, 21], "back": [0, 2, 9, 10, 20, 21], "anoth": [0, 5, 6, 9, 17, 24], "best": [0, 6, 15], "world": [0, 20, 22], "determin": [0, 9, 14, 16, 20, 21], "mode": [0, 9, 16], "otherwis": [0, 9, 17], "argument": [0, 2, 9, 10], "hi_res_model_nam": 0, "shown": [0, 9, 12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "code": [0, 6, 7, 8, 9, 10, 20, 21], "block": [0, 7], "doe": [0, 5, 9, 10, 17, 19], "structur": [0, 3, 9, 10, 12, 13, 15, 20, 21, 22, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "ensur": [0, 2, 9, 20, 21, 22, 23], "pdf_infer_table_structur": [0, 9], "fals": [0, 6, 9, 17, 24, 41, 42, 47, 49], "becaus": [0, 9, 17, 24], "computation": 0, "expens": [0, 20, 21], "we": [0, 3, 6, 9, 10, 11, 14, 15, 16, 20, 21, 22, 25], "enabl": [0, 9, 20, 21, 22, 24], "disabl": [0, 9], "than": [0, 2, 17], "skip_infer_table_typ": 0, "want": [0, 9, 10, 19, 20, 21], "skip": [0, 14], "excel": [0, 6, 9], "which": [0, 2, 3, 5, 7, 9, 10, 14, 16, 19, 20, 22, 24], "jpg": [0, 3, 9, 14, 17], "xl": [0, 9], "don": [0, 9, 11, 25], "t": [0, 6, 9, 11, 14, 16], "empti": [0, 9, 10, 19], "xml_keep_tag": [0, 9], "retain": [0, 20, 22], "simpli": [0, 10], "self": [0, 9], "strongli": 0, "suggest": 0, "so": [0, 9, 10, 20, 21], "contain": [0, 6, 8, 9, 12, 13, 17, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "follow": [0, 2, 3, 4, 5, 7, 9, 10, 14, 17, 18, 19, 20, 21, 22, 24], "intend": [0, 1], "help": [0, 6, 9, 10, 12, 13, 14, 15, 17, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "up": [0, 6, 8, 10, 22, 23], "interact": 0, "machin": [0, 6, 8, 15, 16, 17, 19], "multi": [0, 16], "platform": [0, 7, 11, 12, 13, 15, 16, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "built": 0, "both": [0, 9, 16, 20, 23], "x86_64": [0, 16], "appl": [0, 16], "silicon": [0, 16], "hardwar": [0, 16], "pull": [0, 3, 14], "should": [0, 2, 4, 6, 10, 14, 16, 20, 21, 24], "download": [0, 2, 13, 14, 16, 28, 32, 48], "correspond": [0, 4, 9, 24], "architectur": [0, 16], "e": [0, 7, 9, 10, 16, 17, 24, 47], "g": [0, 9, 16, 24, 37, 47], "linux": [0, 16], "amd64": [0, 16], "push": [0, 16], "main": [0, 5, 9, 10, 16, 17, 35], "each": [0, 7, 8, 9, 10, 19, 20, 22, 24], "short": [0, 16, 24], "commit": [0, 16], "hash": [0, 16, 20, 21], "fbc7a69": [0, 16], "0": [0, 5, 6, 8, 9, 10, 12, 13, 14, 16, 17, 20, 21, 24, 36], "5": [0, 9, 10, 16, 20, 21], "dev1": [0, 16], "most": [0, 9, 10, 16, 19, 20, 21, 22, 23, 42], "recent": [0, 16], "latest": [0, 16], "leverag": [0, 2, 3], "repositori": [0, 16], "quai": 0, "onc": [0, 10, 16, 20, 22], "launch": [0, 20, 23], "web": 0, "app": [0, 42, 47], "localhost": [0, 9, 10, 14, 34], "8000": 0, "p": [0, 6], "d": [0, 6, 8, 9, 20, 21, 22, 24], "rm": 0, "name": [0, 2, 7, 8, 10, 12, 14, 16, 17, 24, 27, 34, 42, 44], "port": 0, "ll": [0, 9, 12, 13, 14, 17, 20, 21, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "fork": 0, "sart": 0, "one": [0, 5, 10], "A": [0, 6, 8, 9, 10, 20, 22, 24], "jupyt": 0, "notebook": [0, 19], "server": [0, 9, 24], "guid": [0, 2, 12, 13, 16, 20, 21, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "mani": [0, 6, 8, 20, 22], "o": [0, 2, 7, 9, 10, 12, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "depend": [0, 2, 9, 12, 13, 14, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "requir": [0, 16, 17, 18, 19, 20, 21, 22, 23], "abil": [0, 9], "desir": 0, "hit": [0, 14, 17], "directori": [0, 10, 14, 17, 20, 21, 24], "sever": [0, 4], "unstructur": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "offer": [1, 3, 19, 21, 22], "few": [1, 9, 10, 17, 20, 23], "strategi": [1, 2, 9, 10, 12, 24], "model": [1, 6, 7, 8, 9, 10, 17, 19, 21, 23, 24], "extract": [1, 2, 3, 4, 9, 15, 19, 20, 21], "These": [1, 3, 9, 16, 20, 21, 22, 24], "guidelin": 1, "configur": 1, "optim": [1, 2, 15, 19, 20, 22], "high": [1, 10, 20, 22], "level": [1, 6, 8, 17, 24], "librari": [1, 2, 3, 4, 6, 9, 14, 16, 17, 18, 19, 20, 21, 23], "ocr": [2, 3, 9, 17, 20, 21], "base": [2, 3, 7, 9, 10, 17, 19, 20, 21, 24], "transform": [2, 6, 8, 10, 17, 19, 20, 23], "detect": [2, 5, 6, 7, 8, 9, 20, 21, 22, 24], "complex": 2, "predict": [2, 10, 14, 19], "type": [2, 3, 4, 9, 10, 12, 13, 14, 15, 16, 17, 19, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "basic": [2, 3, 17, 20, 21, 23], "usag": [2, 3, 15, 17, 19, 20, 21], "filenam": [2, 3, 9, 10, 12, 14, 16, 17, 20, 21, 24, 26, 30, 34, 39, 43], "hi_r": [2, 3, 9], "model_nam": [2, 10], "chipper": 2, "defin": [2, 7, 19], "infer": [2, 3, 9, 20, 21, 24], "detectron2_onnx": 2, "comput": [2, 17, 20, 22, 24], "vision": 2, "facebook": 2, "ai": [2, 20, 22], "object": [2, 8, 9, 10, 12, 14, 19, 20, 21], "segment": [2, 9, 20, 22], "algorithm": [2, 20, 22], "onnx": 2, "runtim": 2, "fastest": 2, "yolox": 2, "singl": [2, 9, 12, 16, 17, 20, 22], "stage": [2, 4, 14, 15, 19, 20, 21], "real": [2, 10, 14, 20, 22], "detector": 2, "modifi": [2, 10, 24], "yolov3": 2, "darknet53": 2, "backbon": 2, "yolox_quant": 2, "faster": [2, 9], "its": [2, 5, 24], "speed": [2, 20, 22], "closer": 2, "detectron2": [2, 3, 9, 17], "beta": 2, "version": [2, 9, 12, 16, 24], "hous": 2, "imag": [2, 3, 5, 9, 17, 20, 21, 24], "visual": [2, 6, 8, 9], "understand": [2, 4, 20, 21, 22], "vdu": 2, "unstructured_hi_res_model_nam": 2, "environ": [2, 7, 17, 19], "variabl": [2, 7, 17], "There": [2, 4, 9, 10, 20, 22], "three": [2, 6, 8, 9], "store": [2, 9, 10, 12, 13, 14, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "pdf": [2, 3, 9, 10, 12, 14, 15, 16, 17, 20, 21, 23, 24, 28, 45], "partition_pdf": [2, 3, 10, 16, 17, 20, 21], "out_yolox": 2, "unstructured_infer": [2, 17], "get_model": 2, "documentlayout": 2, "from_fil": 2, "detection_model": 2, "util": [2, 10, 14, 20, 21], "zoo": 2, "In": [2, 6, 9, 10, 14, 20, 21, 23], "layoutpars": [2, 17], "variou": 2, "pre": [2, 4, 9, 10, 14, 19], "train": [2, 9, 20, 21, 22], "analysi": [2, 9, 19, 20, 22], "featur": [2, 9, 14], "unstructureddetectronmodel": 2, "class": [2, 7, 10, 14, 24], "faster_rcnn_r_50_fpn_3x": 2, "pretrain": [2, 20, 23], "doclaynet": 2, "But": 2, "differ": [2, 3, 4, 9, 10, 20, 21, 22], "construct": 2, "paramet": [2, 3, 6, 8, 9, 10, 19], "light": 2, "wrapper": 2, "around": [2, 10], "detectron2layoutmodel": 2, "same": [2, 9, 10, 14, 20, 21, 24], "seamlessli": 2, "integr": [2, 7, 15, 20, 22, 23], "custom": [2, 6, 9, 20, 22, 23], "wrap": 2, "unstructuredobjectdetectionmodel": 2, "act": 2, "intermediari": 2, "between": [2, 5, 6, 8, 9], "workflow": [2, 9, 10, 14, 15, 20, 21, 22, 23], "subclass": [2, 7], "incorpor": 2, "two": [2, 6, 8, 9, 10, 20, 21, 24], "vital": 2, "method": [2, 6, 7, 9, 14, 20, 21, 24], "design": [2, 15, 19, 20, 22, 23], "pil": [2, 17], "return": [2, 6, 7, 8, 9, 10, 14, 19, 24], "layoutel": 2, "facilit": [2, 20, 22], "commun": [2, 11, 12, 13, 25], "initi": [2, 10], "essenti": [2, 9, 20, 21], "load": [2, 19], "prep": 2, "guarante": [2, 20, 21], "readi": [2, 10, 14, 19, 20, 21], "incom": 2, "task": [2, 6, 10, 14, 15, 19, 20, 22, 23], "output": [2, 6, 8, 9, 10, 12, 13, 14, 15, 19, 20, 21, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "specif": [2, 9, 17, 20, 22, 23, 24], "smoothli": 2, "perform": 2, "varieti": [3, 19, 20, 22, 24], "preprocess": [3, 15, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "characterist": [3, 9], "tradit": [3, 20, 23], "nlp": [3, 6, 8, 10, 20, 22], "techniqu": [3, 20, 22], "quickli": [3, 10, 20, 22], "good": [3, 14, 20, 22], "file": [3, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "identifi": [3, 9, 20, 22], "us": [3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 17, 19, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "advantag": [3, 9], "gain": [3, 6, 9], "addit": [3, 9, 10, 14, 17, 20, 21], "about": [3, 6, 7, 8, 9, 10, 12, 13, 15, 19, 20, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "case": [3, 4, 5, 6, 7, 9, 21, 42, 46], "highli": [3, 9, 20, 21, 22], "sensit": [3, 9], "correct": [3, 9], "classif": [3, 9, 10, 14, 20, 22], "optic": 3, "charact": [3, 5, 6, 8, 9, 20, 22], "recognit": [3, 10, 20, 21], "brick": [3, 6, 8, 9, 10, 15, 19], "tabl": [3, 5, 9, 11, 17, 24, 25], "support": [3, 9, 10, 12, 13, 14, 16, 17, 19, 20, 21, 23, 24], "partition_imag": 3, "ye": [3, 9, 10], "encod": [3, 6, 7, 9], "page": [3, 5, 9, 20, 21, 24, 41, 49], "break": [3, 5, 6, 8, 9, 20, 22], "languag": [3, 6, 8, 9, 19, 24], "max": [3, 9], "live": 4, "primari": [4, 9, 20, 21, 24], "public": [4, 37], "api": [4, 7, 9, 10, 14, 15, 19, 20, 21, 23], "clean": [4, 8, 15, 20, 22], "chunk": [4, 7, 10, 19, 24], "after": [4, 5, 8, 10, 14, 16, 17, 19, 20, 21, 22], "read": [4, 9, 20, 21, 22], "how": [4, 5, 6, 7, 9, 10, 14, 15, 16, 17, 19, 20, 21, 22, 24], "remov": [4, 6, 8, 20, 22, 24], "unwant": [4, 6], "prepar": [4, 6, 10, 19], "downstream": [4, 6, 10, 15, 19, 20, 22, 23, 24], "retriev": [4, 5, 7, 23], "augment": [4, 5, 7, 23], "rag": [4, 5, 7, 20, 22, 23], "metadata": [5, 7, 9, 10, 12, 15, 20, 21, 26, 30, 34, 39, 43], "split": [5, 6, 8, 10, 17, 19, 20, 21], "subsect": 5, "combin": [5, 9, 19], "look": [5, 6, 8, 9, 10, 14, 20, 21, 24], "presenc": 5, "titl": [5, 9, 10, 19, 20, 21, 24, 49], "new": [5, 6, 9, 10, 11, 13, 14, 19, 24, 25], "creat": [5, 10, 13, 14, 16, 17, 24], "non": [5, 6, 9], "alwai": 5, "own": [5, 6], "chang": [5, 6, 8, 10, 17, 24], "occur": [5, 8], "number": [5, 7, 8, 9, 10, 24], "come": [5, 10, 14, 19, 20, 21], "attach": [5, 9, 10, 24], "instead": [5, 9, 10, 19, 20, 21, 22], "multipage_sect": 5, "allow": [5, 6, 9, 19, 20, 21, 24], "span": 5, "length": [5, 9, 10], "exce": 5, "new_after_n_char": 5, "1500": [5, 9], "possibl": [5, 17], "lenght": 5, "narrativetext": [5, 9, 10, 14, 19, 20, 21, 24], "similarli": 5, "under": [5, 7, 9, 14, 20, 21], "combine_under_n_char": 5, "thei": [5, 6, 9, 14, 20, 21, 22], "threshold": 5, "500": [5, 12], "seri": 5, "sometim": [5, 6], "happen": [5, 6], "listitem": [5, 9, 20, 21], "turn": [5, 9, 24], "off": [5, 9, 10], "behavior": [5, 6, 8, 9, 20, 21], "show": [5, 7, 9, 10, 12, 13, 14, 20, 21], "partition_html": [5, 6], "understandingwar": 5, "org": [5, 9, 17], "background": 5, "russian": [5, 6, 8], "offens": 5, "campaign": [5, 46], "assess": 5, "august": 5, "27": 5, "2023": [5, 9, 20, 23, 48], "print": [5, 6, 7, 9, 10, 12, 13, 20, 21, 24], "n": [5, 6, 8, 9, 20, 21], "80": 5, "As": [6, 9, 10, 20, 21], "part": [6, 10, 19, 20, 21, 22], "common": [6, 8, 9, 17, 42], "prior": [6, 9], "could": [6, 10, 20, 21], "impact": [6, 10], "qualiti": 6, "user": [6, 9, 10, 14, 20, 21, 22, 24, 30, 39, 42, 43, 44], "sanit": 6, "befor": [6, 8, 9, 20, 21, 22], "send": 6, "some": [6, 7, 9, 10, 14, 17, 20, 21, 22, 24], "automat": [6, 9, 20, 21], "philadelphia": [6, 9], "eagles\u00e2": 6, "x80": 6, "x99": 6, "victori": 6, "convert": [6, 9, 10, 14, 19, 22], "eagl": [6, 9], "snippet": [6, 20, 21], "cleaner": [6, 8, 9], "core": [6, 8, 9, 20, 21, 22], "without": [6, 9, 20, 21], "instanti": 6, "expect": [6, 10], "callabl": 6, "string": [6, 7, 8, 9, 10, 12, 24], "produc": [6, 9, 10], "invok": [6, 9], "sinc": 6, "just": [6, 19], "str": [6, 9, 10, 19, 20, 21], "easili": [6, 20, 23], "citat": 6, "re": [6, 8, 9, 12, 13, 14, 17, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "remove_cit": 6, "lambda": 6, "sub": [6, 8, 24], "1": [6, 7, 8, 9, 14, 17, 24], "3": [6, 8, 14, 20, 21, 24], "geoloc": 6, "combat": 6, "footag": 6, "confirm": [6, 20, 21], "dvorichn": 6, "area": [6, 10], "northwest": 6, "svatov": 6, "like": [6, 9, 10, 14, 16, 20, 21, 22, 24], "byte": [6, 14], "emoji": 6, "isn": 6, "hello": [6, 10], "\u00f0": 6, "x9f": 6, "x98": 6, "charset": 6, "sourc": [6, 8, 9, 10, 15, 19, 20, 21, 22, 23, 24, 49], "bullet": [6, 8, 24], "extra": [6, 9, 12, 13, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "whitespac": [6, 8], "dash": 6, "trail": [6, 8], "punctuat": [6, 8], "lowercas": 6, "extra_whitespac": 6, "trailing_punctu": 6, "item": [6, 24, 43], "1a": 6, "risk": [6, 10, 14, 19], "factor": [6, 20, 22], "begin": [6, 8, 9, 10], "appear": [6, 9, 10, 20, 21, 24], "love": [6, 8], "mors": 6, "handl": [6, 20, 21, 23], "special": [6, 10], "u2013": 6, "xa0": 6, "newlin": 6, "ascii": [6, 8], "x88thi": 6, "containsnon": 6, "alphanumer": [6, 8], "veri": [6, 8, 10], "b": 6, "postfix": 6, "match": [6, 12], "pattern": [6, 8, 10, 15, 17, 20, 21], "ignor": 6, "ignore_cas": 6, "strip": [6, 8], "end": [6, 8, 9, 10, 24, 48], "stop": [6, 8], "prefix": [6, 10], "lead": [6, 8, 46], "summari": [6, 14], "descript": [6, 10, 12, 24], "group": [6, 7, 8, 9], "togeth": [6, 8, 9], "paragraph": [6, 8, 9], "broken": [6, 8, 9, 20, 22], "line": [6, 8, 9, 16, 17, 24], "format": [6, 8, 9, 10, 14, 19, 20, 23], "purpos": [6, 8, 9], "line_split": [6, 8], "consid": [6, 8, 10, 17], "paragraph_split": [6, 8], "big": [6, 8, 9], "brown": [6, 8, 9], "fox": [6, 8, 9, 10], "wa": [6, 8, 9, 17, 24], "walk": [6, 8, 9], "down": [6, 8, 9, 10, 20, 21, 22], "lane": [6, 8, 9], "At": [6, 8, 9, 14, 17, 20, 22, 24], "met": [6, 8, 9], "bear": [6, 8, 9, 20, 22], "para_split_r": [6, 8], "compil": [6, 8, 17], "unicod": [6, 8], "quot": [6, 8], "replac": [6, 8], "x91": [6, 8], "replace_unicode_charact": [6, 8], "x93a": [6, 8], "x94": [6, 8], "x91a": [6, 8], "x92": [6, 8], "translat": [6, 8], "helsinki": [6, 8], "mt": [6, 8], "chines": [6, 8], "arab": [6, 8], "other": [6, 8, 15, 16, 20, 21, 24], "source_lang": [6, 8], "letter": [6, 8], "langdetect": [6, 8], "target_lang": [6, 8], "target": [6, 8, 9], "en": [6, 8, 17], "m": [6, 8, 12, 20, 21], "berlin": [6, 8], "ich": [6, 8], "bin": [6, 8], "ein": [6, 8], "\u044f": [6, 8], "\u0442\u043e\u0436\u0435": [6, 8], "\u043c\u043e\u0436\u043d\u043e": [6, 8], "\u043f\u0435\u0440\u0435\u0432\u043e\u0434\u0430\u0442\u044c": [6, 8], "\u0440\u0443\u0441\u0441\u043a\u0438\u0439": [6, 8], "\u044f\u0437\u044b\u043a": [6, 8], "ru": [6, 8], "obtain": [7, 19], "abstract": 7, "implement": [7, 20, 22], "embeddingencod": 7, "langchain": [7, 20, 22], "openai": [7, 20, 22], "hood": [7, 9], "connect": [7, 11, 15, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "piec": [7, 20, 22], "embed_docu": 7, "receiv": [7, 8, 16], "updat": [7, 24], "attribut": [7, 9, 10, 14, 20, 21], "embed_queri": 7, "queri": [7, 10, 20, 22, 34, 44], "float": 7, "vector": [7, 10, 19], "given": [7, 20, 22, 24], "num_of_dimens": 7, "properti": [7, 24], "denot": 7, "dimens": [7, 12], "via": [7, 20, 22], "is_unit_vector": 7, "unit": [7, 20, 22], "openai_api_kei": 7, "abl": [7, 20, 21, 22], "visit": 7, "com": [7, 8, 9, 10, 17, 36, 42, 47], "account": [7, 10, 14, 27, 38, 46], "emb": 7, "embedding_encod": 7, "api_kei": [7, 9, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "sentenc": [7, 20, 21, 22], "2": [7, 8, 10, 12, 13, 14, 17, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "query_embed": 7, "date": [8, 20, 22, 24, 48], "timezon": 8, "datetim": 8, "abc": 8, "def": [8, 10], "local": [8, 9, 10, 14, 15, 19, 20, 21, 25], "ba23": 8, "58b5": 8, "2236": 8, "45g2": 8, "88h2": 8, "local2": 8, "25": 8, "mapi": 8, "id": [8, 12, 14, 24, 38, 41, 42, 43, 44, 47], "32": 8, "88": 8, "5467": 8, "123": 8, "fri": 8, "26": 8, "mar": 8, "2021": [8, 20, 22], "11": [8, 10], "04": [8, 14, 48], "09": 8, "1200": 8, "4": [8, 10, 12, 14], "9": [8, 10, 17], "tzinfo": 8, "timedelta": 8, "second": [8, 9, 10], "43200": 8, "email": [8, 9, 20, 21, 30, 39, 42, 43], "address": [8, 20, 21, 24], "me": 8, "10": [8, 9, 10, 24, 44], "01": [8, 9], "ipv4": 8, "ipv6": 8, "ip": 8, "none": [8, 9, 17, 32], "index": [8, 9, 19, 20, 22, 24, 34], "th": [8, 20, 21], "occurr": 8, "speaker": [8, 24], "fly": 8, "am": 8, "phone": 8, "215": 8, "867": 8, "5309": 8, "raw": [9, 20, 21, 22], "decid": 9, "keep": [9, 12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "particular": [9, 20, 21], "summar": [9, 20, 21], "interest": [9, 24], "call": [9, 14, 20, 21, 22], "libmag": [9, 17, 20, 21], "appropri": [9, 10, 12, 13, 16, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "where": [9, 10, 20, 21, 23, 24], "filetyp": [9, 12, 14, 20, 21, 24], "extens": [9, 20, 21, 22, 23], "rout": [9, 20, 21], "know": [9, 11, 25], "directli": [9, 10, 16, 19], "mail": [9, 24], "partition_eml": 9, "No": 9, "markdown": 9, "offic": [9, 10, 20, 21], "plain": [9, 20, 21], "grouper": 9, "restructur": 9, "rich": 9, "word": [9, 10, 20, 22], "xml": [9, 14, 15, 20, 21], "tag": [9, 16, 24], "addition": [9, 17, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "bypass": 9, "logic": 9, "content_typ": 9, "either": [9, 14], "join": [9, 10], "example_docs_directori": 9, "el": [9, 20, 21], "15": [9, 10], "reason": 9, "unnecessari": [9, 20, 22], "program": 9, "fewer": 9, "least": [9, 10, 20, 21], "denomin": 9, "certain": [9, 16], "learn": [9, 14, 15, 19], "www": 9, "cnn": 9, "30": [9, 10], "sport": 9, "empir": 9, "state": [9, 10, 20, 22], "build": [9, 17, 20, 22], "green": 9, "spt": 9, "intl": 9, "simplest": [9, 19, 20, 21], "attempt": 9, "control": 9, "accur": [9, 20, 22, 24], "add_paragraph": 9, "style": 9, "head": [9, 20, 23], "my": [9, 10, 17, 24], "first": [9, 10, 12, 13, 14, 17, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "thought": 9, "bodi": 9, "normal": 9, "save": [9, 10, 20, 21], "mydoc": 9, "remot": [9, 10, 12, 24, 27, 29, 33, 37, 45], "forc": 9, "treat": 9, "mime": [9, 17], "conjunct": [9, 17], "ssl_verifi": 9, "whether": 9, "ssl": 9, "verif": 9, "githubusercont": 9, "licens": 9, "text_as_html": [9, 12, 20, 21, 24], "represent": [9, 20, 21, 22, 24], "stanlei": 9, "cup": 9, "microsoft": [9, 17, 47], "partiton_doc": 9, "libreoffic": [9, 20, 21], "footer": [9, 20, 21, 24], "per": [9, 10], "msft": 9, "header_footer_typ": [9, 12, 24], "indic": [9, 10, 24], "valid": [9, 10, 19, 24], "first_pag": [9, 24], "even_pag": 9, "present": [9, 14, 17, 20, 21], "insert": 9, "render": 9, "even": [9, 10, 20, 22], "them": [9, 16, 20, 21], "export": 9, "client": [9, 10, 42, 43, 44, 47], "outlook": [9, 17, 24, 25], "gmail": 9, "content_sourc": 9, "respect": [9, 16], "sender": [9, 24], "recipi": [9, 24], "etc": 9, "include_head": 9, "must": [9, 10, 17], "tupl": 9, "max_partit": 9, "maximum": 9, "select": [9, 10, 14, 24], "roughli": 9, "averag": [9, 10, 14], "process_attach": 9, "attachment_partition": 9, "pgp": 9, "encrypt": 9, "emit": 9, "warn": [9, 17], "book": [9, 24], "epub3": 9, "pandoc": [9, 20, 21], "system": [9, 10, 12, 15, 17, 19, 20, 21, 24], "winter": 9, "invoc": 9, "equival": 9, "10k": [9, 20, 21], "illustr": 9, "fetch": 9, "agent": [9, 44], "yourscriptnam": 9, "websit": 9, "articl": 9, "grab": [9, 20, 22], "site": [9, 24, 36, 47], "convent": 9, "activ": [9, 10, 14, 17], "html_assemble_articl": 9, "deu": 9, "german": 9, "pack": 9, "pars": [9, 14, 16, 17, 42], "swedish": 9, "swe": 9, "infer_table_structur": [9, 20, 21], "recoomend": 9, "readm": 9, "similar": [9, 10, 14, 20, 22], "rest": [9, 20, 23], "narr": [9, 10, 14, 20, 21], "contextlib": 9, "exitstack": 9, "stack": [9, 19], "enter_context": 9, "metadata_filenam": 9, "execut": [9, 12, 13, 20, 21], "token": [9, 10, 14, 17, 19, 26, 30, 32, 33, 39, 43, 48], "authent": 9, "pdfminer": 9, "copi": 9, "protect": 9, "cannot": 9, "fail": [9, 12, 13], "issu": [9, 17, 20, 22, 24], "powerpoint": 9, "paragraph_group": 9, "group_broken_paragraph": 9, "yourself": 9, "explicitli": 9, "my_api_kei": 9, "messag": [9, 24], "rfc822": 9, "ad": [9, 10, 11, 14, 25, 42], "da": 9, "1p": 9, "api_url": 9, "5000": 9, "sheet": [9, 24], "xml_path": 9, "conjunt": 9, "restrict": 9, "factbook": 9, "packag": [10, 15, 16, 17, 24], "ingest": [10, 12, 13, 15, 19, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "dictionari": [10, 14, 19, 24], "labelstudio": 10, "upload": [10, 12, 13, 14, 19], "label": 10, "label_studio": [10, 14], "narrative_text": 10, "dump": [10, 14], "indent": [10, 14, 24], "isd": 10, "isd_csv": 10, "panda": [10, 14], "datafram": [10, 14, 19], "df": 10, "repres": [10, 20, 21, 22, 24], "prodigi": 10, "write": [10, 13, 15, 19], "prodigy_csv_data": 10, "w": [10, 14], "csv_file": 10, "argilla": 10, "dataset": [10, 19, 20, 21, 22, 23], "argilla_task": [10, 19], "text_classif": [10, 19], "token_classif": [10, 19], "text2text": [10, 19], "nltk": 10, "argilla_dataset": 10, "basepl": 10, "llm": [10, 19], "backend": [10, 19], "spreadsheet": [10, 19], "interfac": [10, 14, 19, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "elementmetadata": [10, 24], "wonder": 10, "stori": 10, "ran": 10, "chicken": 10, "coop": 10, "flew": 10, "row": [10, 19], "element_id": [10, 12, 20, 21], "ad270eefd1cc68d15f4d3e51666d4dc8": 10, "8275769fdd1804f9f2b55ad3c9b0ef1b": 10, "datasaur": 10, "text1": 10, "text2": 10, "datasaur_data": 10, "entiti": [10, 12, 19], "hi": [10, 24], "matt": 10, "start_idx": 10, "end_idx": 10, "labelbox": 10, "cloud": [10, 19, 25], "output_directori": [10, 19], "storag": [10, 11, 12, 13, 15, 19, 20, 22, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "servic": [10, 15, 19, 38], "config": [10, 12, 19], "dict": [10, 19], "written": [10, 12, 19], "s3": [10, 12, 13, 17, 19, 24, 25, 31], "aw": 10, "sync": 10, "url_prefix": 10, "demonstr": 10, "bucket": [10, 19], "label_box": 10, "s3_bucket_nam": 10, "s3_bucket_key_prefix": 10, "access": [10, 15, 20, 22, 24, 26, 43], "s3_url_prefix": 10, "amazonaw": 10, "local_output_directori": 10, "tmp": 10, "labelbox_config": 10, "external_id": 10, "id1": 10, "id2": 10, "raw_text": 10, "create_directori": 10, "labelbox_config_fil": 10, "upload_staged_fil": 10, "s3f": 10, "s3filesystem": 10, "listdir": 10, "filepath": 10, "upload_kei": 10, "put_fil": 10, "lpath": 10, "rpath": 10, "folder": [10, 19, 38, 43], "project": [10, 14, 19, 20, 22], "label_studio_data": [10, 14], "text_field": [10, 14], "my_text": 10, "id_field": [10, 14], "my_id": 10, "annot": [10, 14, 19], "labelstudioannot": [10, 14], "labelstudioresult": [10, 14], "posit": [10, 14], "from_nam": [10, 14], "sentiment": [10, 19], "to_nam": [10, 14], "score": [10, 14], "labelstudiopredict": [10, 14], "68": [10, 14], "misc": 10, "prodigy_data": 10, "jsonl": [10, 19], "feed": 10, "loader": [10, 19], "save_as_jsonl": 10, "fit": [10, 19, 20, 23], "attent": [10, 19], "window": [10, 12, 19], "autotoken": 10, "automodelfortokenclassif": 10, "huggingfac": 10, "hf": 10, "intern": 10, "test": [10, 12, 16, 29, 30, 37, 39], "tini": 10, "bert": 10, "from_pretrain": 10, "ner": 10, "frost": 10, "advisori": 10, "morn": 10, "strong": 10, "cold": 10, "front": 10, "later": [10, 20, 21], "week": 10, "chanc": 10, "refresh": 10, "crisp": 10, "air": 10, "pronounc": 10, "goe": 10, "were": 10, "place": [10, 19, 24], "across": [10, 20, 22], "portion": 10, "appalachian": 10, "coastal": 10, "temperatur": 10, "drop": 10, "40": 10, "far": 10, "south": 10, "florida": 10, "panhandl": 10, "And": [10, 20, 21, 23], "had": 10, "report": [10, 20, 21], "snow": 10, "season": 10, "sundai": 10, "citi": 10, "moder": 10, "dure": 10, "next": [10, 14, 20, 22], "much": 10, "east": [10, 13, 31], "stai": 10, "right": [10, 20, 21], "norm": 10, "blast": 10, "potenti": 10, "hazard": 10, "condit": 10, "weather": 10, "evolv": 10, "continu": [10, 24], "weekend": 10, "coupl": 10, "move": 10, "central": 10, "eastern": 10, "center": 10, "said": 10, "potent": 10, "canada": 10, "punch": 10, "chilli": 10, "heavi": 10, "rain": 10, "wind": 10, "slight": 10, "excess": 10, "rainfal": 10, "northeast": 10, "england": 10, "thursdai": 10, "york": 10, "buffalo": 10, "burlington": 10, "out": [10, 20, 21, 22], "flash": 10, "flood": 10, "confid": [10, 20, 22], "grow": 10, "region": [10, 20, 21], "experi": 10, "gusti": 10, "period": [10, 20, 22, 32], "along": [10, 14], "ahead": 10, "passag": 10, "nation": 10, "wrote": 10, "accompani": 10, "bring": 10, "inch": 10, "isol": 10, "locat": [10, 20, 22], "ensembl": 10, "forecast": 10, "median": 10, "total": 10, "wednesdai": 10, "night": 10, "half": 10, "spot": 10, "substanti": 10, "grand": 10, "rapid": 10, "enough": [10, 20, 22], "mix": 10, "fridai": 10, "especi": [10, 20, 22], "higher": 10, "terrain": 10, "north": 10, "toward": 10, "cadillac": 10, "mph": 10, "caus": 10, "tree": 10, "limb": 10, "sporad": 10, "outag": 10, "behind": 10, "coast": 10, "degre": 10, "workweek": 10, "go": [10, 14], "50": 10, "great": 10, "lake": [10, 13, 31], "explain": 10, "reinforc": 10, "shot": 10, "countri": 10, "keyword": 10, "buffer": [10, 19], "leav": [10, 20, 22], "cl": 10, "sequenc": 10, "max_input_s": 10, "size": [10, 14, 20, 22, 23], "model_max_length": 10, "split_funct": [10, 19], "space": 10, "chunk_separ": [10, 19], "concat": 10, "adjac": 10, "reconstruct": 10, "oper": [10, 19], "chunk_by_attention_window": [10, 19], "helper": [10, 19], "weaviat": [10, 20, 22], "databas": [10, 19, 21, 41], "create_unstructured_weaviate_class": 10, "class_nam": 10, "unstructured_class": 10, "unstructureddocu": 10, "8080": [10, 14], "batch": [10, 11, 12, 13, 15, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "been": 10, "generate_uuid5": 10, "data_object": 10, "batch_siz": 10, "tqdm": 10, "add_data_object": 10, "unstructured_class_nam": 10, "uuid": [10, 20, 21], "favorit": [11, 15, 19, 25], "effortless": [11, 15, 25], "constantli": [11, 25], "let": [11, 25], "u": [11, 13, 20, 21, 25, 31, 44], "slack": [11, 17, 25], "delta": [11, 17, 25], "azur": [11, 17, 24, 25, 42], "cognit": [11, 17], "search": [11, 17, 20, 22, 44], "record": [12, 13, 31], "filesystem": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "those": [12, 13, 20, 21], "pip": [12, 13, 17, 19, 20, 21, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "upstream": [12, 13, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "connector": [12, 13, 15, 17, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "ones": [12, 13, 20, 22], "conveni": [12, 13], "command": [12, 13, 14, 17], "utic": [12, 13, 29, 31, 37, 45], "dev": [12, 13, 17, 20, 21, 31, 45], "tech": [12, 13, 19, 31, 45], "fixtur": [12, 13, 29, 31, 37, 45], "small": [12, 20, 23, 45], "anonym": [12, 45], "dir": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "num": [12, 26, 27, 28, 29, 30, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49], "verbos": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "azure_search_api_kei": 12, "endpoint": [12, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "azure_search_endpoint": 12, "subprocess": [12, 13], "getenv": [12, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "popen": [12, 13], "stdout": [12, 13], "pipe": [12, 13], "error": [12, 13, 17, 20, 21, 22], "returncod": [12, 13], "successfulli": [12, 13], "els": [12, 13], "cli": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "mind": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "sure": [12, 20, 21], "being": [12, 24], "odata": 12, "context": [12, 20, 22], "net": [12, 30, 39], "etag": 12, "0x8dbb93e09c8f4bd": 12, "edm": 12, "collect": [12, 20, 22], "400": 12, "vectorsearchconfigur": 12, "complextyp": 12, "category_depth": [12, 24], "int32": 12, "parent_id": [12, 24], "attached_to_filenam": [12, 24], "last_modifi": [12, 24], "datetimeoffset": 12, "file_directori": [12, 24, 26, 30, 34, 39, 43], "data_sourc": [12, 26, 30, 34, 39, 43], "date_cr": 12, "date_modifi": 12, "date_process": [12, 26, 30, 34, 39, 43], "record_loc": 12, "coordin": 12, "layout_width": 12, "doubl": 12, "layout_height": 12, "page_numb": [12, 24], "link_url": [12, 24], "link_text": [12, 24], "sent_from": [12, 24], "sent_to": [12, 24], "subject": [12, 24], "emphasized_text_cont": [12, 24], "emphasized_text_tag": [12, 24], "regex_metadata": [12, 24], "detection_class_prob": [12, 24], "vectorsearch": 12, "algorithmconfigur": 12, "kind": 12, "hnsw": 12, "hnswparamet": 12, "metric": 12, "cosin": 12, "efconstruct": 12, "efsearch": 12, "rais": 13, "exist": [13, 14, 20, 22, 24], "uri": [13, 31], "deltat": [13, 31], "storage_opt": [13, 31], "aws_region": [13, 31], "aws_access_key_id": [13, 31], "aws_secret_access_kei": [13, 31], "json_data": 13, "dest": 13, "preserv": [13, 20, 22, 28, 32], "too": 14, "larg": [14, 21, 23], "repo": [14, 17, 20, 21], "sec": [14, 19], "assum": 14, "dummi": 14, "info": [14, 17], "edgar": 14, "stage_for_label_studio": [14, 19], "risk_sect": 14, "prepopul": 14, "ui": 14, "feel": 14, "free": 14, "step": [14, 17, 20, 21, 22], "append": 14, "final": [14, 20, 21], "omit": 14, "did": 14, "studio": 14, "setup": [14, 20, 21], "author": [14, 42], "labelstudio_token": 14, "project_id": 14, "to_dict": [14, 24], "exif": 14, "exif_data": 14, "file_util": 14, "get_jpg_metadata": 14, "get_docx_metadata": 14, "get_xlsx_metadata": 14, "tool": [14, 17, 20, 22], "get_directory_file_info": 14, "recurs": [14, 29, 33, 37, 38, 40, 41, 42, 43, 46, 47], "subdirectori": [14, 17], "file_info": 14, "value_count": 14, "dtype": 14, "int64": 14, "groupbi": 14, "mean": [14, 20, 22, 23], "files": 14, "660200e": 14, "490885e": 14, "05": 14, "228404e": 14, "06": 14, "276400e": 14, "429245e": 14, "832900e": 14, "6": 14, "113333e": 14, "02": [14, 48], "765000e": 14, "03": 14, "7": [14, 36], "135000e": 14, "advanc": [15, 20, 22, 23], "destin": 15, "track": [15, 24], "easi": [15, 19, 20, 22], "popular": 15, "ml": [15, 19], "practic": [15, 20, 22], "haven": 16, "howev": [16, 20, 22], "flag": [16, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "everi": [16, 17], "branch": [16, 17, 35, 36], "dt": 16, "bash": 16, "exec": 16, "plan": 16, "acceler": 16, "exclud": [16, 26, 30, 34, 39, 43], "dockerfil": 16, "necessari": [16, 17], "python3": 16, "partition_text": [16, 20, 21, 24], "complet": 17, "cater": [17, 20, 22], "beyond": 17, "airtabl": [17, 24, 25, 41, 42, 43, 49], "biom": [17, 25], "confluenc": [17, 24, 25], "discord": [17, 24, 25], "dropbox": [17, 24, 25], "elasticsearch": [17, 24, 25], "gc": [17, 24, 37], "github": [17, 25], "gitlab": [17, 25], "googl": [17, 24, 25], "drive": [17, 24, 25], "jira": [17, 24, 25], "notion": [17, 25], "onedr": [17, 24, 42], "reddit": [17, 25], "sharepoint": [17, 24, 25], "salesforc": [17, 25], "wikipedia": [17, 24, 25], "involv": [17, 20, 22], "anaconda": 17, "stackoverflow": 17, "pycocotool": 17, "env": 17, "yml": 17, "virtual": 17, "virtualenviron": 17, "challeng": 17, "offici": 17, "compat": 17, "altern": [17, 20, 22], "pip3": 17, "git": [17, 35, 36], "philferrier": 17, "cocoapi": 17, "egg": 17, "pythonapi": 17, "outlin": 17, "clone": 17, "ivanpp": 17, "cd": 17, "iopath": 17, "facebookresearch": 17, "Then": 17, "753": 17, "file_io": 17, "py": 17, "parsed_url": 17, "navig": 17, "verifi": [17, 20, 22], "root": 17, "esri": 17, "detectron2_tool": 17, "modul": 17, "conflict": 17, "describ": 17, "kmp_duplicate_lib_ok": 17, "prevent": 17, "libiomp5md": 17, "dll": 17, "link": [17, 18, 19, 24, 42], "numpi": 17, "np": 17, "img": 17, "arrai": 17, "lang": 17, "use_gpu": 17, "show_log": 17, "log_level": 17, "debug": 17, "mac": 17, "brew": 17, "One": [17, 20, 21, 22, 25], "debian": 17, "sudo": 17, "apt": 17, "y": [17, 24], "forg": 17, "libxml2": 17, "libxlst": 17, "libxslt": 17, "rust": 17, "properli": 17, "proto": 17, "tlsv1": 17, "ssf": 17, "sh": 17, "rustup": 17, "sentencepiec": 17, "earlier": 17, "while": 17, "remain": 17, "newer": [17, 20, 22], "backward": 17, "might": [17, 20, 21, 22], "deprec": 17, "futur": [17, 20, 21, 24], "releas": 17, "advis": 17, "transit": 17, "lot": [18, 20, 21, 22], "docker": 18, "develop": [19, 20, 22, 23], "framework": 19, "stage_for_argilla": 19, "stage_for_basepl": 19, "stage_for_datasaur": 19, "customis": 19, "stage_for_transform": 19, "window_s": 19, "stage_for_label_box": 19, "With": 19, "incredibli": 19, "matter": [19, 20, 23], "unstructuredfileload": 19, "document_load": 19, "state_of_the_union": 19, "checkout": 19, "gpt": [19, 20, 22], "llama": [19, 20, 22], "split_docu": 19, "separ": 19, "simpl": 19, "pathlib": 19, "llama_index": 19, "download_load": 19, "unstructuredread": 19, "load_data": 19, "10k_file": 19, "llamahub": 19, "convert_to_datafram": 19, "stage_for_prodigi": 19, "stage_csv_for_prodigi": 19, "compon": [19, 20, 21, 23], "emerg": 19, "stage_for_weavi": 19, "aim": [20, 23], "simplifi": [20, 23], "streamlin": [20, 23], "toolkit": [20, 23], "digest": [20, 22, 23], "usabl": [20, 22, 23], "softwar": [20, 23, 49], "scalabl": [20, 23], "quantiti": [20, 23], "enterpris": [20, 23], "hope": [20, 23], "late": [20, 23], "seamless": [20, 23], "classic": [20, 23], "modern": [20, 21, 23], "myriad": [20, 22, 23], "effici": [20, 22, 23], "regardless": [20, 23], "customiz": [20, 23], "extend": [20, 23], "fine": [20, 22, 23], "tune": [20, 22, 23], "etl": [20, 23], "eager": [20, 23], "dive": [20, 23], "over": [20, 23], "minut": [20, 23], "explor": [20, 23], "natur": [20, 22], "encompass": [20, 22], "broad": [20, 22], "spectrum": [20, 22], "methodologi": [20, 22], "introduc": [20, 22], "fundament": [20, 22], "crucial": [20, 22], "often": [20, 22], "signific": [20, 22], "segreg": [20, 22], "smaller": [20, 22], "manag": [20, 22], "anomali": [20, 22], "fill": [20, 22], "miss": [20, 22], "elimin": [20, 22], "irrelev": [20, 22], "erron": [20, 22], "significantli": [20, 22], "influenc": [20, 22], "outcom": [20, 22], "subsequ": [20, 21, 22], "consist": [20, 22], "divid": [20, 22, 24], "lengthi": [20, 22], "meaning": [20, 22], "textual": [20, 22], "fix": [20, 22], "semant": [20, 22], "cluster": [20, 22], "priorit": [20, 22], "aspect": [20, 21, 22], "manner": [20, 22], "foundat": [20, 22], "groundwork": [20, 22], "proper": [20, 22], "vastli": [20, 22], "improv": [20, 22], "decompos": [20, 22], "analyz": [20, 22], "vast": [20, 22], "amount": [20, 22], "capac": [20, 22], "comprehend": [20, 22], "human": [20, 22], "art": [20, 22], "multitud": [20, 22], "domain": [20, 22], "chatgpt": [20, 22], "anthrop": [20, 22], "claud": [20, 22], "revolution": [20, 22], "landscap": [20, 22], "prowess": [20, 22], "inher": [20, 22], "suffer": [20, 22], "drawback": [20, 22], "major": [20, 22], "static": [20, 22], "frozen": [20, 22], "instanc": [20, 22, 24], "knowledg": [20, 22], "limit": [20, 21, 22], "septemb": [20, 22], "blind": [20, 22], "despit": [20, 22], "respond": [20, 22], "unwarr": [20, 22], "phenomenon": [20, 22], "known": [20, 22], "hallucin": [20, 22], "Such": [20, 22], "detriment": [20, 22], "serv": [20, 22], "critic": [20, 22], "groundbreak": [20, 22], "counteract": [20, 22], "pair": [20, 22], "underli": [20, 22], "transpar": [20, 22], "approach": [20, 22], "claim": [20, 22], "accuraci": [20, 22], "trust": [20, 22], "among": [20, 22], "moreov": [20, 22], "cost": [20, 22], "effect": [20, 22], "solut": [20, 22], "financi": [20, 22], "burden": [20, 22], "finetun": [20, 22], "situat": [20, 22], "suffici": [20, 22], "reduct": [20, 22], "resourc": [20, 22], "consumpt": [20, 22], "particularli": [20, 22], "benefici": [20, 22], "organ": [20, 22], "lack": [20, 22], "deploi": [20, 22], "scratch": [20, 22], "acquir": [20, 22], "super": [20, 22], "ve": [20, 21, 22], "artifact": [20, 22], "unneccesari": [20, 22], "found": [20, 22], "consum": [20, 22, 46], "haystack": [20, 22], "funcion": [20, 22], "numer": [20, 22], "coher": [20, 22], "hug": [20, 22], "face": [20, 22], "pinecon": [20, 22], "milvu": [20, 22], "chromadd": [20, 22], "prompt": [20, 22], "blog": [20, 22], "concis": [20, 21], "swiftli": [20, 21], "sdk": [20, 21], "immedi": [20, 21], "vari": [20, 21], "poppler": [20, 21], "opt": [20, 21], "congratul": [20, 21], "success": [20, 21], "cover": [20, 21], "cut": [20, 21], "chase": [20, 21], "goal": [20, 21], "categor": [20, 21], "associ": [20, 21, 24], "cell": [20, 21], "observ": [20, 21], "figurecapt": [20, 21], "uncategorizedtext": [20, 21], "formula": [20, 21], "figur": [20, 21], "notic": [20, 21], "suitabl": [20, 21], "text_typ": [20, 21], "sentence_count": [20, 21], "100": [20, 21, 24], "isinst": [20, 21], "rel": [20, 21, 24], "would": [20, 21], "model1": [20, 21], "publaynet": [20, 21], "38": [20, 21], "scientif": [20, 21], "prima": [20, 21], "scan": [20, 21], "magazin": [20, 21], "newspap": [20, 21], "17": [20, 21], "20th": [20, 21], "centuri": [20, 21], "tablebank": [20, 21], "18": [20, 21], "busi": [20, 21], "hjdataset": [20, 21], "31": [20, 21], "histori": [20, 21], "japanes": [20, 21], "thead": [20, 21], "tr": [20, 21], "td": [20, 21], "convert_to_dict": [20, 21], "seen": [20, 21], "elements_to_json": [20, 21], "elements_from_json": [20, 21], "sha": [20, 21], "256": [20, 21], "determinist": [20, 21], "downsid": [20, 21], "collis": [20, 21], "unique_element_id": [20, 21], "conclud": [20, 21], "input_filenam": [20, 21], "output_filenam": [20, 21], "concept": 21, "product": 22, "uniqu": 23, "filter": 24, "last": 24, "xy": 24, "further": 24, "hierarchi": 24, "parent": 24, "resid": 24, "overal": 24, "depth": 24, "partition": 24, "processor": 24, "nativ": 24, "reflect": 24, "h1": 24, "h2": 24, "h3": 24, "probabl": 24, "emphas": 24, "bold": 24, "ital": 24, "origin": 24, "is_continu": 24, "previou": 24, "due": 24, "max_charact": 24, "usual": 24, "corner": 24, "top": 24, "left": 24, "proceed": 24, "counter": 24, "clockwis": 24, "pixel": 24, "increas": 24, "downward": 24, "direct": 24, "typic": 24, "pixelspac": 24, "orient": 24, "width": 24, "height": 24, "convert_coordinates_to_new_system": 24, "in_plac": 24, "alter": 24, "relativecoordinatesystem": 24, "200": 24, "coordinate_system": 24, "850": 24, "1100": 24, "term": 24, "page_nam": 24, "even_onli": 24, "favor": 24, "rfc": 24, "822": 24, "spec": 24, "sent": [24, 43], "ever": 24, "view": 24, "fsspec": 24, "protocol": 24, "channel": [24, 32, 48], "pname": [24, 42], "speak": 24, "person": [26, 43], "airtable_personal_access_token": [26, 43], "reprocess": [26, 43], "partitionconfig": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "readconfig": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "runner": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "__name__": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "__main__": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "read_config": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "partition_config": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "output_dir": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "num_process": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "personal_access_token": 26, "partition_by_api": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "unstructured_api_kei": [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "abf": 27, "container1": 27, "azureunstructured1": 27, "remote_url": [27, 29, 33, 37, 45], "account_nam": 27, "oa_pdf": 28, "07": 28, "sbaa031": 28, "073": 28, "pmc7234218": 28, "preserve_download": [28, 32], "box_app_config": 29, "box_app_config_path": 29, "atlassian": [30, 39], "12345678": [30, 32, 39, 48], "abcde1234abde1234abcde1234": [30, 39], "metadata_exclud": [30, 34, 39], "user_email": [30, 39, 43], "api_token": [30, 39], "delta_t": 31, "table_uri": 31, "discord_token": 32, "download_dir": 32, "dropbox_token": 33, "9200": 34, "movi": 34, "ethnic": 34, "director": 34, "plot": 34, "index_nam": 34, "jq_queri": 34, "git_branch": [35, 36], "docsi": 36, "gdrive": 38, "google_dr": 38, "drive_id": 38, "popul": [38, 41], "WITH": 38, "OR": 38, "service_account_kei": 38, "input_path": 40, "comma": 41, "delimit": 41, "page_id": 41, "OF": 41, "database_id": 41, "cred": [42, 43, 47], "secret": [42, 44, 47], "login": 42, "microsoftonlin": 42, "tenant": [42, 43, 47], "tenant_id": 42, "princip": 42, "client_id": [42, 43, 44, 47], "client_cr": [42, 43, 47], "authority_url": 42, "user_pnam": 42, "ms_client_id": 43, "ms_client_cr": 43, "ms_tenant_id": 43, "ms_user_email": 43, "inbox": 43, "outlook_fold": 43, "subreddit": 44, "machinelearn": 44, "fetcher": 44, "subreddit_nam": 44, "client_secret": 44, "user_ag": 44, "search_queri": 44, "num_post": 44, "usernam": 46, "salesforce_usernam": 46, "salesforce_consumer_kei": 46, "privat": 46, "salesforce_private_key_path": 46, "emailmessag": 46, "consumer_kei": 46, "private_key_path": 46, "contoso": 47, "admin": 47, "share": 47, "files_onli": 47, "01t01": 48, "00": 48, "08": 48, "start_dat": 48, "end_dat": 48, "page_titl": 49, "auto_suggest": 49}, "objects": {}, "objtypes": {}, "objnames": {}, "titleterms": {"unstructur": [0, 15, 17], "api": [0, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "support": 0, "file": 0, "type": [0, 24], "paramet": 0, "coordin": [0, 24], "encod": 0, "ocr": 0, "languag": [0, 20, 22], "output": 0, "format": 0, "page": 0, "break": 0, "strategi": [0, 3], "beta": 0, "version": [0, 17], "hi_r": 0, "chipper": 0, "model": [0, 2, 20, 22], "tabl": [0, 13, 20, 21, 31], "extract": [0, 8, 14, 24], "pdf": 0, "other": 0, "filetyp": [0, 17], "xml": [0, 17], "tag": 0, "us": [0, 2, 16, 20, 23], "local": [0, 12, 13, 17, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "docker": [0, 16], "imag": [0, 16], "develop": 0, "best": 1, "practic": 1, "non": 2, "default": 2, "bring": 2, "your": [2, 16], "own": [2, 16], "brick": 4, "chunk": [5, 20, 22], "chunk_by_titl": 5, "clean": 6, "bytes_string_to_str": 6, "clean_bullet": 6, "clean_dash": 6, "clean_extra_whitespac": 6, "clean_non_ascii_char": 6, "clean_ordered_bullet": 6, "clean_postfix": 6, "clean_prefix": 6, "clean_trailing_punctu": 6, "group_broken_paragraph": [6, 8], "remove_punctu": [6, 8], "replace_unicode_quot": [6, 8], "translate_text": [6, 8], "embed": [7, 20, 22], "baseembeddingencod": 7, "openaiembeddingencod": 7, "extract_datetimetz": 8, "extract_email_address": 8, "extract_ip_address": 8, "extract_ip_address_nam": 8, "extract_mapi_id": 8, "extract_ordered_bullet": 8, "extract_text_aft": 8, "extract_text_befor": 8, "extract_us_phone_numb": 8, "partit": [9, 20, 21], "partition_csv": 9, "partition_doc": 9, "partition_docx": 9, "partition_email": 9, "partition_epub": 9, "partition_html": 9, "partition_imag": 9, "partition_md": 9, "partition_msg": 9, "partition_multiple_via_api": 9, "partition_odt": 9, "partition_org": 9, "partition_pdf": 9, "partition_ppt": 9, "partition_pptx": 9, "partition_rst": 9, "partition_rtf": 9, "partition_text": 9, "partition_tsv": 9, "partition_via_api": 9, "partition_xlsx": 9, "partition_xml": 9, "stage": 10, "convert_to_csv": 10, "convert_to_datafram": 10, "convert_to_dict": 10, "dict_to_el": 10, "stage_csv_for_prodigi": 10, "stage_for_argilla": 10, "stage_for_basepl": 10, "stage_for_datasaur": 10, "stage_for_label_box": 10, "stage_for_label_studio": 10, "stage_for_prodigi": 10, "stage_for_transform": 10, "stage_for_weavi": 10, "destin": 11, "connector": [11, 24, 25], "azur": [12, 27], "cognit": 12, "search": 12, "run": [12, 13, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "sampl": 12, "index": 12, "schema": 12, "delta": [13, 31], "exampl": 14, "sentiment": 14, "analysi": 14, "label": [14, 19], "labelstudio": 14, "metadata": [14, 24], "from": 14, "document": [14, 15, 20, 21, 24], "explor": 14, "sourc": [14, 25], "core": 15, "librari": 15, "instal": [16, 17, 18, 20, 21], "prerequisit": 16, "pull": 16, "build": 16, "interact": 16, "python": 16, "insid": 16, "contain": 16, "full": 17, "conda": 17, "window": 17, "set": 17, "up": [17, 20, 21], "infer": 17, "paddleocr": 17, "log": 17, "extra": 17, "depend": 17, "detect": 17, "html": 17, "huggingfac": 17, "note": 17, "older": 17, "integr": 19, "argilla": 19, "basepl": 19, "datasaur": 19, "hug": 19, "face": 19, "labelbox": 19, "studio": 19, "langchain": 19, "llamaindex": 19, "panda": 19, "prodigi": 19, "weaviat": 19, "introduct": [20, 23], "overview": [20, 23], "product": [20, 23], "offer": [20, 23], "kei": [20, 22, 23], "featur": [20, 23], "common": [20, 23, 24], "case": [20, 23], "quickstart": [20, 23], "tutori": [20, 23], "concept": [20, 22], "data": [20, 22, 24], "ingest": [20, 22], "preprocess": [20, 22], "text": [20, 22], "vector": [20, 22], "databas": [20, 22], "token": [20, 22], "larg": [20, 22], "llm": [20, 22], "retriev": [20, 22], "augment": [20, 22], "gener": [20, 22], "get": [20, 21], "start": [20, 21], "quick": [20, 21], "valid": [20, 21], "element": [20, 21], "convert": [20, 21], "dictionari": [20, 21], "json": [20, 21], "uniqu": [20, 21], "id": [20, 21], "wrap": [20, 21], "all": [20, 21], "field": 24, "addit": 24, "email": 24, "microsoft": 24, "excel": 24, "word": 24, "via": [24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "record": 24, "locat": 24, "advanc": 24, "option": 24, "regex": 24, "airtabl": 26, "biom": 28, "box": 29, "confluenc": 30, "discord": 32, "dropbox": 33, "elasticsearch": 34, "github": 35, "gitlab": 36, "googl": [37, 38], "cloud": 37, "storag": 37, "drive": [38, 42], "jira": 39, "notion": 41, "One": 42, "outlook": 43, "reddit": 44, "s3": 45, "salesforc": 46, "sharepoint": 47, "slack": 48, "wikipedia": 49}, "envversion": {"sphinx.domains.c": 2, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 8, "sphinx.domains.index": 1, "sphinx.domains.javascript": 2, "sphinx.domains.math": 2, "sphinx.domains.python": 3, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 57}, "alltitles": {"Unstructured API": [[0, "unstructured-api"]], "Supported File Types": [[0, "supported-file-types"]], "Parameters": [[0, "parameters"]], "Coordinates": [[0, "coordinates"], [24, "coordinates"]], "Encoding": [[0, "encoding"]], "OCR Languages": [[0, "ocr-languages"]], "Output Format": [[0, "output-format"]], "Page Break": [[0, "page-break"]], "Strategies": [[0, "strategies"], [3, "strategies"]], "Beta Version: hi_res Strategy with Chipper Model": [[0, "beta-version-hi-res-strategy-with-chipper-model"]], "Table Extraction": [[0, "table-extraction"]], "PDF Table Extraction": [[0, "pdf-table-extraction"]], "Table Extraction for other filetypes": [[0, "table-extraction-for-other-filetypes"]], "XML Tags": [[0, "xml-tags"]], "Using the API Locally": [[0, "using-the-api-locally"]], "Using Docker Images": [[0, "using-docker-images"]], "Developing with the API Locally": [[0, "developing-with-the-api-locally"]], "Best Practices": [[1, "best-practices"]], "Models": [[2, "models"]], "Using a Non-Default Model": [[2, "using-a-non-default-model"]], "Bring Your Own Models": [[2, "bring-your-own-models"]], "Bricks": [[4, "bricks"]], "Chunking": [[5, "chunking"]], "chunk_by_title": [[5, "chunk-by-title"]], "Cleaning": [[6, "cleaning"]], "bytes_string_to_string": [[6, "bytes-string-to-string"]], "clean": [[6, "clean"]], "clean_bullets": [[6, "clean-bullets"]], "clean_dashes": [[6, "clean-dashes"]], "clean_extra_whitespace": [[6, "clean-extra-whitespace"]], "clean_non_ascii_chars": [[6, "clean-non-ascii-chars"]], "clean_ordered_bullets": [[6, "clean-ordered-bullets"]], "clean_postfix": [[6, "clean-postfix"]], "clean_prefix": [[6, "clean-prefix"]], "clean_trailing_punctuation": [[6, "clean-trailing-punctuation"]], "group_broken_paragraphs": [[6, "group-broken-paragraphs"], [8, "group-broken-paragraphs"]], "remove_punctuation": [[6, "remove-punctuation"], [8, "remove-punctuation"]], "replace_unicode_quotes": [[6, "replace-unicode-quotes"], [8, "replace-unicode-quotes"]], "translate_text": [[6, "translate-text"], [8, "translate-text"]], "Embedding": [[7, "embedding"]], "BaseEmbeddingEncoder": [[7, "baseembeddingencoder"]], "OpenAIEmbeddingEncoder": [[7, "openaiembeddingencoder"]], "Extracting": [[8, "extracting"]], "extract_datetimetz": [[8, "extract-datetimetz"]], "extract_email_address": [[8, "extract-email-address"]], "extract_ip_address": [[8, "extract-ip-address"]], "extract_ip_address_name": [[8, "extract-ip-address-name"]], "extract_mapi_id": [[8, "extract-mapi-id"]], "extract_ordered_bullets": [[8, "extract-ordered-bullets"]], "extract_text_after": [[8, "extract-text-after"]], "extract_text_before": [[8, "extract-text-before"]], "extract_us_phone_number": [[8, "extract-us-phone-number"]], "Partitioning": [[9, "partitioning"]], "partition": [[9, "partition"]], "partition_csv": [[9, "partition-csv"]], "partition_doc": [[9, "partition-doc"]], "partition_docx": [[9, "partition-docx"]], "partition_email": [[9, "partition-email"]], "partition_epub": [[9, "partition-epub"]], "partition_html": [[9, "partition-html"]], "partition_image": [[9, "partition-image"]], "partition_md": [[9, "partition-md"]], "partition_msg": [[9, "partition-msg"]], "partition_multiple_via_api": [[9, "partition-multiple-via-api"]], "partition_odt": [[9, "partition-odt"]], "partition_org": [[9, "partition-org"]], "partition_pdf": [[9, "partition-pdf"]], "partition_ppt": [[9, "partition-ppt"]], "partition_pptx": [[9, "partition-pptx"]], "partition_rst": [[9, "partition-rst"]], "partition_rtf": [[9, "partition-rtf"]], "partition_text": [[9, "partition-text"]], "partition_tsv": [[9, "partition-tsv"]], "partition_via_api": [[9, "partition-via-api"]], "partition_xlsx": [[9, "partition-xlsx"]], "partition_xml": [[9, "partition-xml"]], "Staging": [[10, "staging"]], "convert_to_csv": [[10, "convert-to-csv"]], "convert_to_dataframe": [[10, "convert-to-dataframe"]], "convert_to_dict": [[10, "convert-to-dict"]], "dict_to_elements": [[10, "dict-to-elements"]], "stage_csv_for_prodigy": [[10, "stage-csv-for-prodigy"]], "stage_for_argilla": [[10, "stage-for-argilla"]], "stage_for_baseplate": [[10, "stage-for-baseplate"]], "stage_for_datasaur": [[10, "stage-for-datasaur"]], "stage_for_label_box": [[10, "stage-for-label-box"]], "stage_for_label_studio": [[10, "stage-for-label-studio"]], "stage_for_prodigy": [[10, "stage-for-prodigy"]], "stage_for_transformers": [[10, "stage-for-transformers"]], "stage_for_weaviate": [[10, "stage-for-weaviate"]], "Destination Connectors": [[11, "destination-connectors"]], "Azure Cognitive Search": [[12, "azure-cognitive-search"]], "Run Locally": [[12, "run-locally"], [13, "run-locally"], [26, "run-locally"], [27, "run-locally"], [28, "run-locally"], [29, "run-locally"], [30, "run-locally"], [31, "run-locally"], [32, "run-locally"], [33, "run-locally"], [34, "run-locally"], [35, "run-locally"], [36, "run-locally"], [37, "run-locally"], [38, "run-locally"], [39, "run-locally"], [40, "run-locally"], [41, "run-locally"], [42, "run-locally"], [43, "run-locally"], [44, "run-locally"], [45, "run-locally"], [46, "run-locally"], [47, "run-locally"], [48, "run-locally"], [49, "run-locally"]], "Sample Index Schema": [[12, "sample-index-schema"]], "Delta Table": [[13, "delta-table"], [31, "delta-table"]], "Examples": [[14, "examples"]], "Sentiment Analysis Labeling in LabelStudio": [[14, "sentiment-analysis-labeling-in-labelstudio"]], "Extracting Metadata from Documents": [[14, "extracting-metadata-from-documents"]], "Exploring Source Documents": [[14, "exploring-source-documents"]], "Unstructured Core Library": [[15, "unstructured-core-library"]], "Library Documentation": [[15, "library-documentation"]], "Docker Installation": [[16, "docker-installation"]], "Prerequisites": [[16, "prerequisites"]], "Pulling the Docker Image": [[16, "pulling-the-docker-image"]], "Using the Docker Image": [[16, "using-the-docker-image"]], "Building Your Own Docker Image": [[16, "building-your-own-docker-image"]], "Interacting with Python Inside the Container": [[16, "interacting-with-python-inside-the-container"]], "Full Installation": [[17, "full-installation"]], "Installation with conda on Windows": [[17, "installation-with-conda-on-windows"]], "Setting up unstructured for local inference": [[17, "setting-up-unstructured-for-local-inference"]], "Installing PaddleOCR": [[17, "installing-paddleocr"]], "Logging": [[17, "logging"]], "Extra Dependencies": [[17, "extra-dependencies"]], "Filetype Detection": [[17, "filetype-detection"]], "XML/HTML Dependencies": [[17, "xml-html-dependencies"]], "Huggingface Dependencies": [[17, "huggingface-dependencies"]], "Note on Older Versions": [[17, "note-on-older-versions"]], "Installation": [[18, "installation"]], "Integrations": [[19, "integrations"]], "Integration with Argilla": [[19, "integration-with-argilla"]], "Integration with Baseplate": [[19, "integration-with-baseplate"]], "Integration with Datasaur": [[19, "integration-with-datasaur"]], "Integration with Hugging Face": [[19, "integration-with-hugging-face"]], "Integration with Labelbox": [[19, "integration-with-labelbox"]], "Integration with Label Studio": [[19, "integration-with-label-studio"]], "Integration with LangChain": [[19, "integration-with-langchain"]], "Integration with LlamaIndex": [[19, "integration-with-llamaindex"]], "Integration with Pandas": [[19, "integration-with-pandas"]], "Integration with Prodigy": [[19, "integration-with-prodigy"]], "Integration with Weaviate": [[19, "integration-with-weaviate"]], "Introduction": [[20, "introduction"], [20, "id1"], [23, "introduction"]], "Overview": [[20, "overview"], [23, "overview"]], "Product Offerings": [[20, "product-offerings"], [23, "product-offerings"]], "Key Features": [[20, "key-features"], [23, "key-features"]], "Common Use Cases": [[20, "common-use-cases"], [23, "common-use-cases"]], "Quickstart Tutorial": [[20, "quickstart-tutorial"], [23, "quickstart-tutorial"]], "Key Concepts": [[20, "key-concepts"], [22, "key-concepts"]], "Data Ingestion": [[20, "data-ingestion"], [22, "data-ingestion"]], "Data Preprocessing": [[20, "data-preprocessing"], [22, "data-preprocessing"]], "Chunking Text for Vector Databases": [[20, "chunking-text-for-vector-databases"], [22, "chunking-text-for-vector-databases"]], "Embeddings": [[20, "embeddings"], [22, "embeddings"]], "Vector Databases": [[20, "vector-databases"], [22, "vector-databases"]], "Tokens": [[20, "tokens"], [22, "tokens"]], "Large Language Models (LLMs)": [[20, "large-language-models-llms"], [22, "large-language-models-llms"]], "Retrieval Augmented Generation": [[20, "retrieval-augmented-generation"], [22, "retrieval-augmented-generation"]], "Getting Started": [[20, "getting-started"], [21, "getting-started"]], "Quick Installation": [[20, "quick-installation"], [21, "quick-installation"]], "Validating Installation": [[20, "validating-installation"], [21, "validating-installation"]], "Partitioning a document": [[20, "partitioning-a-document"], [21, "partitioning-a-document"]], "Document elements": [[20, "document-elements"], [21, "document-elements"]], "Elements": [[20, "elements"], [21, "elements"]], "Tables": [[20, "tables"], [21, "tables"]], "Converting Elements to Dictionary or JSON": [[20, "converting-elements-to-dictionary-or-json"], [21, "converting-elements-to-dictionary-or-json"]], "Unique Element IDs": [[20, "unique-element-ids"], [21, "unique-element-ids"]], "Wrapping it all up": [[20, "wrapping-it-all-up"], [21, "wrapping-it-all-up"]], "Metadata": [[24, "metadata"]], "Common Metadata Fields": [[24, "common-metadata-fields"]], "Additional Metadata Fields by Document Type": [[24, "additional-metadata-fields-by-document-type"]], "Email": [[24, "email"]], "Microsoft Excel Documents": [[24, "microsoft-excel-documents"]], "Microsoft Word Documents": [[24, "microsoft-word-documents"]], "Data Connector Metadata Fields": [[24, "data-connector-metadata-fields"]], "Common Data Connector Metadata Fields": [[24, "common-data-connector-metadata-fields"]], "Additional Metadata Fields by Connector Type (via record locator)": [[24, "additional-metadata-fields-by-connector-type-via-record-locator"]], "Advanced Metadata Options": [[24, "advanced-metadata-options"]], "Extract Metadata with Regexes": [[24, "extract-metadata-with-regexes"]], "Source Connectors": [[25, "source-connectors"]], "Airtable": [[26, "airtable"]], "Run via the API": [[26, "run-via-the-api"], [27, "run-via-the-api"], [28, "run-via-the-api"], [29, "run-via-the-api"], [30, "run-via-the-api"], [31, "run-via-the-api"], [32, "run-via-the-api"], [33, "run-via-the-api"], [34, "run-via-the-api"], [35, "run-via-the-api"], [36, "run-via-the-api"], [37, "run-via-the-api"], [38, "run-via-the-api"], [39, "run-via-the-api"], [40, "run-via-the-api"], [41, "run-via-the-api"], [42, "run-via-the-api"], [43, "run-via-the-api"], [44, "run-via-the-api"], [45, "run-via-the-api"], [46, "run-via-the-api"], [47, "run-via-the-api"], [48, "run-via-the-api"], [49, "run-via-the-api"]], "Azure": [[27, "azure"]], "Biomed": [[28, "biomed"]], "Box": [[29, "box"]], "Confluence": [[30, "confluence"]], "Discord": [[32, "discord"]], "Dropbox": [[33, "dropbox"]], "Elasticsearch": [[34, "elasticsearch"]], "Github": [[35, "github"]], "Gitlab": [[36, "gitlab"]], "Google Cloud Storage": [[37, "google-cloud-storage"]], "Google Drive": [[38, "google-drive"]], "Jira": [[39, "jira"]], "Local": [[40, "local"]], "Notion": [[41, "notion"]], "One Drive": [[42, "one-drive"]], "Outlook": [[43, "outlook"]], "Reddit": [[44, "reddit"]], "S3": [[45, "s3"]], "Salesforce": [[46, "salesforce"]], "Sharepoint": [[47, "sharepoint"]], "Slack": [[48, "slack"]], "Wikipedia": [[49, "wikipedia"]]}, "indexentries": {}})
\ No newline at end of file
diff --git a/source_connectors.html b/source_connectors.html
index 359bf5695d..5a04db6b64 100644
--- a/source_connectors.html
+++ b/source_connectors.html
@@ -7,7 +7,7 @@
- Source Connectors - Unstructured 0.10.19 documentation
+ Source Connectors - Unstructured 0.10.20 documentation
diff --git a/source_connectors/airtable.html b/source_connectors/airtable.html
index 7c52b8190d..30db287624 100644
--- a/source_connectors/airtable.html
+++ b/source_connectors/airtable.html
@@ -7,7 +7,7 @@
- Airtable - Unstructured 0.10.19 documentation
+ Airtable - Unstructured 0.10.20 documentation
diff --git a/source_connectors/azure.html b/source_connectors/azure.html
index 9785f516e2..b7d9d62712 100644
--- a/source_connectors/azure.html
+++ b/source_connectors/azure.html
@@ -7,7 +7,7 @@
- Azure - Unstructured 0.10.19 documentation
+ Azure - Unstructured 0.10.20 documentation
diff --git a/source_connectors/biomed.html b/source_connectors/biomed.html
index da51e7bd1f..f367f0df98 100644
--- a/source_connectors/biomed.html
+++ b/source_connectors/biomed.html
@@ -7,7 +7,7 @@
- Biomed - Unstructured 0.10.19 documentation
+ Biomed - Unstructured 0.10.20 documentation
diff --git a/source_connectors/box.html b/source_connectors/box.html
index 70da419c43..63b6d3be2e 100644
--- a/source_connectors/box.html
+++ b/source_connectors/box.html
@@ -7,7 +7,7 @@
- Box - Unstructured 0.10.19 documentation
+ Box - Unstructured 0.10.20 documentation
diff --git a/source_connectors/confluence.html b/source_connectors/confluence.html
index 7af665273a..58a6070a2b 100644
--- a/source_connectors/confluence.html
+++ b/source_connectors/confluence.html
@@ -7,7 +7,7 @@
- Confluence - Unstructured 0.10.19 documentation
+ Confluence - Unstructured 0.10.20 documentation
diff --git a/source_connectors/delta_table.html b/source_connectors/delta_table.html
index 07545cf0ce..3689148e01 100644
--- a/source_connectors/delta_table.html
+++ b/source_connectors/delta_table.html
@@ -7,7 +7,7 @@
- Delta Table - Unstructured 0.10.19 documentation
+ Delta Table - Unstructured 0.10.20 documentation
diff --git a/source_connectors/discord.html b/source_connectors/discord.html
index b4fb4b4cb7..340e50fb99 100644
--- a/source_connectors/discord.html
+++ b/source_connectors/discord.html
@@ -7,7 +7,7 @@
- Discord - Unstructured 0.10.19 documentation
+ Discord - Unstructured 0.10.20 documentation
diff --git a/source_connectors/dropbox.html b/source_connectors/dropbox.html
index 4ef93f64af..2650674023 100644
--- a/source_connectors/dropbox.html
+++ b/source_connectors/dropbox.html
@@ -7,7 +7,7 @@
- Dropbox - Unstructured 0.10.19 documentation
+ Dropbox - Unstructured 0.10.20 documentation
diff --git a/source_connectors/elasticsearch.html b/source_connectors/elasticsearch.html
index 93bec00a5c..212109778a 100644
--- a/source_connectors/elasticsearch.html
+++ b/source_connectors/elasticsearch.html
@@ -7,7 +7,7 @@
- Elasticsearch - Unstructured 0.10.19 documentation
+ Elasticsearch - Unstructured 0.10.20 documentation
diff --git a/source_connectors/github.html b/source_connectors/github.html
index 53249e9616..964b33e0ae 100644
--- a/source_connectors/github.html
+++ b/source_connectors/github.html
@@ -7,7 +7,7 @@
- Github - Unstructured 0.10.19 documentation
+ Github - Unstructured 0.10.20 documentation
diff --git a/source_connectors/gitlab.html b/source_connectors/gitlab.html
index b70b14c73a..e65b340ce3 100644
--- a/source_connectors/gitlab.html
+++ b/source_connectors/gitlab.html
@@ -7,7 +7,7 @@
- Gitlab - Unstructured 0.10.19 documentation
+ Gitlab - Unstructured 0.10.20 documentation
diff --git a/source_connectors/google_cloud_storage.html b/source_connectors/google_cloud_storage.html
index d0bd78e8bb..8ea331b343 100644
--- a/source_connectors/google_cloud_storage.html
+++ b/source_connectors/google_cloud_storage.html
@@ -7,7 +7,7 @@
- Google Cloud Storage - Unstructured 0.10.19 documentation
+ Google Cloud Storage - Unstructured 0.10.20 documentation
diff --git a/source_connectors/google_drive.html b/source_connectors/google_drive.html
index 604fc38227..115776cad5 100644
--- a/source_connectors/google_drive.html
+++ b/source_connectors/google_drive.html
@@ -7,7 +7,7 @@
- Google Drive - Unstructured 0.10.19 documentation
+ Google Drive - Unstructured 0.10.20 documentation
diff --git a/source_connectors/jira.html b/source_connectors/jira.html
index c452cfc015..038de0c499 100644
--- a/source_connectors/jira.html
+++ b/source_connectors/jira.html
@@ -7,7 +7,7 @@
- Jira - Unstructured 0.10.19 documentation
+ Jira - Unstructured 0.10.20 documentation
diff --git a/source_connectors/local_connector.html b/source_connectors/local_connector.html
index 073a08e1df..1b217b017e 100644
--- a/source_connectors/local_connector.html
+++ b/source_connectors/local_connector.html
@@ -7,7 +7,7 @@
- Local - Unstructured 0.10.19 documentation
+ Local - Unstructured 0.10.20 documentation
diff --git a/source_connectors/notion.html b/source_connectors/notion.html
index 8f57a8ec92..b38c974b7a 100644
--- a/source_connectors/notion.html
+++ b/source_connectors/notion.html
@@ -7,7 +7,7 @@
- Notion - Unstructured 0.10.19 documentation
+ Notion - Unstructured 0.10.20 documentation
diff --git a/source_connectors/onedrive.html b/source_connectors/onedrive.html
index d80070a718..e94dc0662e 100644
--- a/source_connectors/onedrive.html
+++ b/source_connectors/onedrive.html
@@ -7,7 +7,7 @@
- One Drive - Unstructured 0.10.19 documentation
+ One Drive - Unstructured 0.10.20 documentation
diff --git a/source_connectors/outlook.html b/source_connectors/outlook.html
index 494d884df7..78e0cafce4 100644
--- a/source_connectors/outlook.html
+++ b/source_connectors/outlook.html
@@ -7,7 +7,7 @@
- Outlook - Unstructured 0.10.19 documentation
+ Outlook - Unstructured 0.10.20 documentation
diff --git a/source_connectors/reddit.html b/source_connectors/reddit.html
index 547e02761e..8faa94ecf1 100644
--- a/source_connectors/reddit.html
+++ b/source_connectors/reddit.html
@@ -7,7 +7,7 @@
- Reddit - Unstructured 0.10.19 documentation
+ Reddit - Unstructured 0.10.20 documentation
diff --git a/source_connectors/s3.html b/source_connectors/s3.html
index d908e0741c..30316d72d6 100644
--- a/source_connectors/s3.html
+++ b/source_connectors/s3.html
@@ -7,7 +7,7 @@
- S3 - Unstructured 0.10.19 documentation
+ S3 - Unstructured 0.10.20 documentation
diff --git a/source_connectors/salesforce.html b/source_connectors/salesforce.html
index e3e258907c..e10a77fd25 100644
--- a/source_connectors/salesforce.html
+++ b/source_connectors/salesforce.html
@@ -7,7 +7,7 @@
- Salesforce - Unstructured 0.10.19 documentation
+ Salesforce - Unstructured 0.10.20 documentation
diff --git a/source_connectors/sharepoint.html b/source_connectors/sharepoint.html
index 02cb128304..b7d04e84a4 100644
--- a/source_connectors/sharepoint.html
+++ b/source_connectors/sharepoint.html
@@ -7,7 +7,7 @@
- Sharepoint - Unstructured 0.10.19 documentation
+ Sharepoint - Unstructured 0.10.20 documentation
diff --git a/source_connectors/slack.html b/source_connectors/slack.html
index 7fea01769f..dc86247cc6 100644
--- a/source_connectors/slack.html
+++ b/source_connectors/slack.html
@@ -7,7 +7,7 @@
- Slack - Unstructured 0.10.19 documentation
+ Slack - Unstructured 0.10.20 documentation
diff --git a/source_connectors/wikipedia.html b/source_connectors/wikipedia.html
index 53c224c553..ff91a80833 100644
--- a/source_connectors/wikipedia.html
+++ b/source_connectors/wikipedia.html
@@ -7,7 +7,7 @@
- Wikipedia - Unstructured 0.10.19 documentation
+ Wikipedia - Unstructured 0.10.20 documentation