Convenience CLI wrapper around the Marker API which is currently the best-in-class for PDF->Markdown conversion. See the /examples folder for conversion examples. Note that Github's Markdown renderer doesn't have full support for inline math equations, so view them locally for a proper comparison.
- Supported inputs: PDF, Word (.doc, .docx), PowerPoint (.ppt, .pptx), Images (.png, .jpg, .jpeg, .webp, .gif, .tiff)
- Supported outputs: Markdown and JSON
- Automatically splits large PDFs into smaller files for processing, speeds up processing up to 10x
- Can OCR in ~any language that's supported by modern LLMs
- Automatic file name cleaning and organization
- Optional --llm flag can sometimes improve accuracy, especially for tables and inline math
- Excellent support for inline math equations
- Stores the requests status in a local cache file, lets you resume interrupted conversions
- Clone and navigate to the repository:
git clone <repository-url>
cd marker_pdf_to_md
- Install dependencies:
pip install -r requirements.txt
- Set up your API key:
- Get your API key from datalab.to
- Set the environment variable
MARKER_PDF_KEY
:export MARKER_PDF_KEY=your_api_key_here
- For permanent setup, add to your shell configuration file (e.g.,
.bashrc
or.zshrc
):echo 'export MARKER_PDF_KEY=your_api_key_here' >> ~/.zshrc source ~/.zshrc
python marker_cli.py input.pdf # single file
python marker_cli.py input_dir/ # directory of files
-
--strip
- Remove and redo OCR on the document
- Useful for files with poor quality existing OCR
-
--force
- Force OCR on every page
- Ignores existing PDF text
- Slower but more accurate for problematic PDFs
-
--llm
- Enable LLM enhancement for better accuracy
- Improves forms, tables, inline math, and layout recognition
- Note: Doubles the per-request cost
-
--max
- Enable all OCR enhancements: ignores existing OCR and uses LLM for all text, equations, and tables
- Likewise doubles the per-request cost
-
--noimg
- Disable image extraction
- When used with
--llm
, converts images to text descriptions
-
--json
- Output in JSON format instead of Markdown
-
--pages
- Add page delimiters to output
- Helps maintain document structure
-
--no-chunk
- Disable PDF chunking (processes entire PDF as one file)
- Useful for small PDFs or when you want to ensure document coherence
- Note: May be slower for large files
-
-cs
,--chunk-size PAGES
- Set custom chunk size in pages (default: 25)
- Larger chunks mean fewer API requests but slower individual processing
- Example:
-cs 50
processes 50 pages per chunk
-
--outdir PATH
- Default:
converted/<filename>/<timestamp>/
- Default:
-
--langs LANGUAGES
- Comma-separated list of languages to use for OCR, useful for mixed language documents
- Example: "English,French"
Process with specific languages:
python marker_cli.py document.pdf --langs "English,French"
Maximum quality conversion:
python marker_cli.py document.pdf --max
JSON output with image extraction disabled:
python marker_cli.py document.pdf --json --noimg
Process large PDF without chunking:
python marker_cli.py document.pdf --no-chunk
Process with custom chunk size of 50 pages:
python marker_cli.py document.pdf --chunk-size 50
# or
python marker_cli.py document.pdf -cs 50
Converted files are organized as follows, the subfolders are created to avoid overwriting previous conversions:
converted/
└── document_name/
└── YY-MM-DD_HH-MM/
├── document.md
└── images/
└── extracted_images...
- If output quality is poor, try enabling
--force
to ignore existing OCR inside the PDF - Ensure correct language settings with
--langs
- Failed conversions should show detailed error messages, open an issue on the repo if you think it's an error with the tool
MIT License