Skip to content

Commit

Permalink
Add system dependencies in README
Browse files Browse the repository at this point in the history
  • Loading branch information
baptiste-pasquier committed Mar 27, 2024
1 parent 7269e90 commit 5c385d0
Showing 1 changed file with 23 additions and 4 deletions.
27 changes: 23 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,10 @@ For all options, we can choose to treat tables as text or images.
**Common parameters**:

- `ingest.clear_database` : Whether to clear the database before ingesting new data.
- `ingest.partition_pdf_func` : Parameters for Unstructured `partition_pdf` function.
- `ingest.chunking_func` : Parameters for Unstructured chunking function.
- `ingest.metadata_keys` : Unstructured metadata to use.
- `ingest.table_format` : How to extract table with Unstructured (`text`, `html` or `image`).
- `ingest.partition_pdf_func` : Parameters for *Unstructured* `partition_pdf` function.
- `ingest.chunking_func` : Parameters for *Unstructured* chunking function.
- `ingest.metadata_keys` : *Unstructured* metadata to use.
- `ingest.table_format` : How to extract table with *Unstructured* (`text`, `html` or `image`).
- `ingest.image_min_size` : Minimum relative size for images to be considered.
- `ingest.table_min_size` : Minimum relative size for tables to be considered.
- `ingest.export_extracted` : Whether to export extracted elements in local folder.
Expand Down Expand Up @@ -133,6 +133,25 @@ To set up the project, ensure you have Python version between 3.10 and 3.11. The
poetry install
```

*Unstructured* requires the following system dependencies:

- *poppler-utils* : Needed for *pdf2image*.
- *tesseract-ocr* : Needed for images and PDFs processing.

Installation on Linux:

```bash
sudo apt update
sudo apt install -y poppler-utils tesseract-ocr
```

Installation on MacOS:

```bash
brew update
brew install poppler tesseract
```

Before running the application, you need to set up the environment variables.
Copy the `template.env` file to a new file named `.env` and fill in the necessary API keys and endpoints:

Expand Down

0 comments on commit 5c385d0

Please sign in to comment.