-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add office files support #52
Feature/add office files support #52
Conversation
lib/rollup.config.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The
onwarn
function was added due to an issue encountered during the execution ofpnpm run build
. The problem arose from a Rollup warning stating that something had been rewritten to undefined. - The
rollup-plugin-polyfill-node
dependency was added to fixMissing global variable names
warning
Thanks for that PR :)
If that's included in the lib, I guess there's no reason to not take advantage of it 👍 |
lib/src/office/office-manager.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a renamed copy of pdf/pdf-manager.ts
@@ -8,16 +8,18 @@ const cpuCount = Platform.isMobileApp ? 1 : require('os').cpus().length | |||
|
|||
const ocrBackgroundProcesses = cpuCount == 2 ? 1 : 2 | |||
|
|||
const officeBackgroundProcesses = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should the number of processes be counted? I have temporarily set a dummy value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave it at 1.
It's a configurable value because initially I tried to balance the number of background workers between the OCR, PDF, and total number of available threads. The goal was to maximize the use of resources to extract files as quickly as possible.
It should probably be refactored, but in the meantime, 1
will be good enough.
Unfortunately this is not the case. I suggested using intermediate formats from which we can get markdown (with additional dependencies), the thing is that all the js libraries that convert office files directly to markdown do the same under the hood. At least the ones I could find. |
If you can manage to keep the URLs that's fine, but honestly I don't even know if they are handled correctly when extracting the text from PDFs either 🤷♂️ |
I'll try working with markdown in another PR then, if you don't mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a closer look at the rollup stuff, but other than that it looks good. Thanks again :)
lib/src/office/office-worker.ts
Outdated
onmessage = async evt => { | ||
try { | ||
let text = '' | ||
if (evt.data.extension == 'docx') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use triple equality where applicable (I think I missed a few myself, but it's cleaner with === :p)
lib/src/index.ts
Outdated
@@ -39,13 +42,19 @@ function isFileImage(path: string): boolean { | |||
) | |||
} | |||
|
|||
function isFileOffice(path: string): boolean { | |||
return ( | |||
path.endsWith('.docx') || path.endsWith('xlsx') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be .xlsx
with a dot, to be consistent
Added support for docx and xlsx files. This PR addresses #10.
Documents are now parsed as plain text, but such an approach results in a loss of hyperlinks. To solve this problem we could consider parsing these files to a markdown format instead, @scambier, what do you think?
Parsing to markdown could be achieved by parsing docx to html (the mammoth lib I've added supports this) and then converting to markdown (this requires another external dependency, but could be useful if we plan to support html).
As for xlsx files, we could convert them to a csv format (the sheetjs lib I'm using gets the plain text from its csv function anyway) and then convert them to md.