Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Mozilla's PDF.js to extract OCR text #59

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

figadore
Copy link

@figadore figadore commented May 4, 2024

Intended to address #21

I'm currently unable to effectively test changes to the code

@figadore figadore changed the title Pdfjs Use Mozilla's PDF.js to extract OCR text May 4, 2024
onmessage = async evt => {
const buffer = Uint8Array.from(decodedPlugin, c => c.charCodeAt(0))
await plugin.default(Promise.resolve(buffer))
onmessage = async path => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log it to make sure, but this should probably stay async evt =>. This call is triggered when you call worker.run() in pdf-manager.ts, so you should receive an event object of the shape

{
  data: {
    path: string,
    name: string
  }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -47,8 +47,9 @@
"@apollo/utils.createhash": "^3.0.0",
"mammoth": "^1.6.0",
"p-queue": "^7.4.1",
"pdfjs-dist": "^4.2.67",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need to add a dependency to pdf.js, as it's already bundled in Obsidian and you should be able to use it with something like this.

const arrayBuffer = await app.vault.readBinary(file);
// @ts-ignore
const document = await window.pdfjsLib.getDocument(arrayBuffer).promise;
for (let i = 1; i <= document.numPages; i++) {
  const page = await document.getPage(i);
  // etc.

But! The bundled version might cause issues so it's still worth trying with an external dependency 👍

Copy link
Author

@figadore figadore May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll try the bundled version first to reduce the number of variables, i'm having a hard enough time just getting a basic development/iteration workflow

@figadore
Copy link
Author

figadore commented May 5, 2024

@scambier I made a few changes and I think I got pdf.js working on a small number of files. I added a 10 second delay, but with a large number of files, it still crashes within a few seconds, which makes me wonder if the problem is related to the queueing somehow (or maybe I just added the delay incorrectly) edit: I've either added the delay incorrectly or am not rebuilding/reloading correctly, since a dozen PDFs still process within a few seconds... I'll keep working

@figadore
Copy link
Author

figadore commented May 6, 2024

Thanks @scambier for the help and guidance so far. The latest place I'm stuck at seems to be related to how omnisearch and text-extractor work together. At least that's my best guess so far.

My workflow to test my changes has been to quit Obsidian, load a bunch of PDFs into the vault directory, and re-open Obsidian. Once open, since Omnisearch has "PDFs content indexing" enabled, the developer console lists a whole bunch of entries like

Omnisearch - 0:39:280 - Generating IndexedDocument from attachments/Y2OOMETPKX3ZMEMJJYX5MHOHTIO3JVSP.pdf

If there are a large enough number of PDFs, Obsidian crashes. Otherwise it successfully extracts all the text from all the new PDFs. None of the debug messages that I placed in the pdf-worker.ts or pdf-manager.ts show up though, like it's somehow shortcutting the queue and calling the extraction library directly. When I right click on a file and choose "Text Extractor" -> "Extract Text to clipboard", however, then my debug log messages show up (and the pdf-worker script throws an exception about Uncaught ReferenceError: obsidian is not defined for the argument in the closure, but I'm guessing that's a separate issue)

Any thoughts? Am I on the right track in thinking Omnisearch may not be using the pdf queue mechanism?

@scambier
Copy link
Owner

scambier commented May 6, 2024

Omnisearch will get a list of all indexable files, and asynchronously convert them to IndexedDocuments.

This conversion is done in 3 different ways:

The queue management happens in Text Extractor: extractText() calls the manager, which uses the queue.

I'm quite confident it works as intended; I had several problems when spawning too many web workers to process the files, and the CPU usage goes through the roof (hence this small trick to leave some room to breathe for the cpu)

@figadore
Copy link
Author

figadore commented May 8, 2024

Progress report:

Gah, I wasted so many hours testing changes to the text-extractor repo, and none of my changes were showing up, and I finally realized it was because of my original attempt at solving this by modifying the omnisearch plugin back when I cloned #290. My changes there made it so nothing I have been testing was having any effect for Omnisearch indexing/extracting (so it was sort of skipping the queue mechanism, just not for the reasons I guessed)

I also somehow missed that pdf-worker.ts is a "web worker", which has its own set of rules for sharing state, loading libraries, etc. I'm not able to use window.pdfjsLib or import { loadPdfJs } from 'obsidian'. I guess my next step is to figure out how to load either the bundled pdf.js lib in the worker, or load it as an external dependency. My assumption is that this is a rollup thing.

@scambier
Copy link
Owner

scambier commented May 8, 2024

mmmh I think PDF.js uses its own web worker(s), so you should be able to remove pdf-worker.ts and instead directly call PDF.js here.

Because yeah, web workers only take serializable data as input/output so they're kinda difficult to work with for anything that requires external dependencies :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants