PDF -> Markdown CLI utility

Convenience CLI wrapper around the Marker API which is currently the best-in-class for PDF->Markdown conversion. See the /examples folder for conversion examples. Note that Github's Markdown renderer doesn't have full support for inline math equations, so view them locally for a proper comparison.

Features

Supported inputs: PDF, Word (.doc, .docx), PowerPoint (.ppt, .pptx), Images (.png, .jpg, .jpeg, .webp, .gif, .tiff)
Supported outputs: Markdown and JSON
Automatically splits large PDFs into smaller files for processing, speeds up processing up to 10x
Can OCR in ~any language that's supported by modern LLMs
Automatic file name cleaning and organization
Optional --llm flag can sometimes improve accuracy, especially for tables and inline math
Excellent support for inline math equations
Stores the requests status in a local cache file, lets you resume interrupted conversions

Installation & Usage

Clone and navigate to the repository:

git clone <repository-url>
cd marker_pdf_to_md

Install dependencies:

pip install -r requirements.txt

Set up your API key:
- Get your API key from datalab.to
- Set the environment variable MARKER_PDF_KEY:
```
export MARKER_PDF_KEY=your_api_key_here
```
- For permanent setup, add to your shell configuration file (e.g., .bashrc or .zshrc):
```
echo 'export MARKER_PDF_KEY=your_api_key_here' >> ~/.zshrc
source ~/.zshrc
```

Usage

python marker_cli.py input.pdf # single file
python marker_cli.py input_dir/ # directory of files

Available Options

--strip
- Remove and redo OCR on the document
- Useful for files with poor quality existing OCR
--force
- Force OCR on every page
- Ignores existing PDF text
- Slower but more accurate for problematic PDFs
--llm
- Enable LLM enhancement for better accuracy
- Improves forms, tables, inline math, and layout recognition
- Note: Doubles the per-request cost
--max
- Enable all OCR enhancements: ignores existing OCR and uses LLM for all text, equations, and tables
- Likewise doubles the per-request cost
--noimg
- Disable image extraction
- When used with --llm, converts images to text descriptions
--json
- Output in JSON format instead of Markdown
--pages
- Add page delimiters to output
- Helps maintain document structure
--no-chunk
- Disable PDF chunking (processes entire PDF as one file)
- Useful for small PDFs or when you want to ensure document coherence
- Note: May be slower for large files
-cs, --chunk-size PAGES
- Set custom chunk size in pages (default: 25)
- Larger chunks mean fewer API requests but slower individual processing
- Example: -cs 50 processes 50 pages per chunk
--outdir PATH
- Default: converted/<filename>/<timestamp>/
--langs LANGUAGES
- Comma-separated list of languages to use for OCR, useful for mixed language documents
- Example: "English,French"

More Usage Examples

Process with specific languages:

python marker_cli.py document.pdf --langs "English,French"

Maximum quality conversion:

python marker_cli.py document.pdf --max

JSON output with image extraction disabled:

python marker_cli.py document.pdf --json --noimg

Process large PDF without chunking:

python marker_cli.py document.pdf --no-chunk

Process with custom chunk size of 50 pages:

python marker_cli.py document.pdf --chunk-size 50
# or
python marker_cli.py document.pdf -cs 50

Output Structure

Converted files are organized as follows, the subfolders are created to avoid overwriting previous conversions:

converted/
└── document_name/
    └── YY-MM-DD_HH-MM/
        ├── document.md
        └── images/
            └── extracted_images...

Troubleshooting

If output quality is poor, try enabling --force to ignore existing OCR inside the PDF
Ensure correct language settings with --langs
Failed conversions should show detailed error messages, open an issue on the repo if you think it's an error with the tool

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_client.py		api_client.py
batch_processor.py		batch_processor.py
cache_manager.py		cache_manager.py
common.py		common.py
main.py		main.py
marker_api_docs.md		marker_api_docs.md
pdf_splitter.py		pdf_splitter.py
requirements.txt		requirements.txt
result_handler.py		result_handler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF -> Markdown CLI utility

Features

Installation & Usage

Usage

Available Options

More Usage Examples

Output Structure

Troubleshooting

License

About

Releases

Packages

Contributors 2

Languages

License

SokolskyNikita/pdf-to-markdown-cli

Folders and files

Latest commit

History

Repository files navigation

PDF -> Markdown CLI utility

Features

Installation & Usage

Usage

Available Options

More Usage Examples

Output Structure

Troubleshooting

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages