Skip to content

Commit

Permalink
How to process a subset of files (#185)
Browse files Browse the repository at this point in the history
  • Loading branch information
Paul-Cornell authored Aug 26, 2024
1 parent c1fe030 commit 545c68c
Show file tree
Hide file tree
Showing 2 changed files with 82 additions and 0 deletions.
81 changes: 81 additions & 0 deletions api-reference/how-to/filter-files.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: Process a subset of files
---

<Note>
The following information applies only to the [Unstructured Ingest CLI](/ingestion/overview#unstructured-ingest-cli) and the [Unstructured Ingest Python library](/ingestion/overview#unstructured-ingest-python-library).

The Unstructured SDKs for Python and JavaScript/TypeScript and the Unstructured open-source library do not support this functionality.
</Note>

## Task

You want to process only files with specified extensions, only files at or below a specified size, or both.

## Approach

For the Ingest CLI, use the following command options. For the Ingest Python library, use the following parameters for the `FiltererConfig` object.

- Use `--file-glob` (CLI) or `file_glob` (Python) to specify the list of file extensions to process.
- Use `--max-file-size` (CLI) or `max_file_size` (Python) to specify the maximum size of files to process, in bytes.

## To run this example

The following example processes only `.pdf` and `.eml` files that have a file size of 100 KB or less. To run this example, you should have a directory
with a mixture of files, including at least one `.pdf` file and one `.eml` file, and with at least one of these files having a file size of 100 KB or less.

## Code

<CodeGroup>
```bash CLI Ingest v2
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--file-glob "*.pdf,*.eml" \
--max-file-size 100000 \
--partition-by-api \
--partition-endpoint $UNSTRUCTURED_API_URL \
--api-key $UNSTRUCTURED_API_KEY
```

```python Python Ingest v2
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.filter import FiltererConfig

if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
filterer_config=FiltererConfig(
file_glob=["*.pdf","*.eml"],
max_file_size=100000
),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"unique_element_ids": True,
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
).run()
```
</CodeGroup>

1 change: 1 addition & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,7 @@
"api-reference/how-to/choose-partitioning-strategy",
"api-reference/how-to/choose-hi-res-model",
"api-reference/how-to/get-elements",
"api-reference/how-to/filter-files",
"api-reference/how-to/embedding",
"api-reference/how-to/parse-simple-pdf-html",
"api-reference/how-to/change-partitioning-strategy",
Expand Down

0 comments on commit 545c68c

Please sign in to comment.