Parse text from PDF files using Python Functions! This includes how to convert PDF pages into images, preprocess those images to correct distortions (like skew), and extract text using OCR with Tesseract.
The packaged code repository uses several libraries, including cv2, pytesseract, and pdf2image, to extract and process text from PDF attachments
The first step is uploading your package to the Foundry Marketplace:
- Download the project's
.zip
file from this repository - Access your enrollment's marketplace at:
{enrollment-url}/workspace/marketplace
- In the marketplace interface, initiate the upload process:
- Select or create a store in your preferred project folder
- Click the "Upload to Store" button
- Select your downloaded
.zip
file
After upload, you'll need to install the package in your environment. For detailed instructions, see the official Palantir documentation.
The installation process has four main stages:
-
General Setup
- Configure package name
- Select installation location
-
Input Configuration
- Configure any required inputs. If no inputs are needed, proceed to next step
- Check project documentation for specific input requirements
-
Content Review
- Review resources to be installed such as Developer Console, the Ontology, and Functions
-
Validation
- System checks for any configuration errors
- Resolve any flagged issues
- Initiate installation