Skip to content

Latest commit

 

History

History

Python Functions for parsing PDF Files with Tesseract OCR

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Parsing PDF Files Using Python: A Guide with Tesseract OCR

Parse text from PDF files using Python Functions! This includes how to convert PDF pages into images, preprocess those images to correct distortions (like skew), and extract text using OCR with Tesseract.

The packaged code repository uses several libraries, including cv2, pytesseract, and pdf2image, to extract and process text from PDF attachments

Upload Package to Your Enrollment

The first step is uploading your package to the Foundry Marketplace:

  1. Download the project's .zip file from this repository
  2. Access your enrollment's marketplace at:
    {enrollment-url}/workspace/marketplace
    
  3. In the marketplace interface, initiate the upload process:
    • Select or create a store in your preferred project folder
    • Click the "Upload to Store" button
    • Select your downloaded .zip file

Marketplace Interface

Install the Package

After upload, you'll need to install the package in your environment. For detailed instructions, see the official Palantir documentation.

The installation process has four main stages:

  1. General Setup

    • Configure package name
    • Select installation location
  2. Input Configuration

    • Configure any required inputs. If no inputs are needed, proceed to next step
    • Check project documentation for specific input requirements
  3. Content Review

    • Review resources to be installed such as Developer Console, the Ontology, and Functions
  4. Validation

    • System checks for any configuration errors
    • Resolve any flagged issues
    • Initiate installation