-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The sidebar doesn't load when a pdf has one page only #203
Comments
does this only happen for single page pdfs or even multi page pdfs? |
please provide publicly accessible test cases if possible |
https://home.ttic.edu/~avrim/book.pdf This is the textbook that I am using. After some experimentation I realized that the issue is, the library you are using to make the pdf annotatable requires intense preprocessing (around 3-4 minutes for initial setup) and until the entire pdf isn't preprocessed, neither the annotation sidebar nor the annotations, show up. This makes sense, since the core is the pdf while the annotations buildup on the pdf itself, but this becomes a really big issue, when such setup time is required each time, the open window is changed/ tab is changed. I tried out the same document with nb1, and found that as each page was rendered as a single image and the annotations where blocklike in nature, thus loosing fine control, the rendering of each page was initiated as and when required, making the process faster. I'm probably missing out a lot of details since I don't know the code thoroughly, but I'd be happy to help out! |
Also, would it be possible to make the code of nb1 publicly available? I wasn't able to find it in the haystack repositories. |
NB1 is at https://github.com/nbproject/nbproject |
The right solution to the problem that you've identified is for NB to "process" the pdf (which nowadays means converting it to html for in-browser rendering) on the server once, and store it there, and deliver that HTML directly to the client at time of use, instead of the current approach of shipping the pdf to each client for processing at the time of use. There should be an issue for this but I can't find it; if it really isn't there we should add it @JumanaFM . |
Exactly, preprocessing is something that i think is happening on the client side, and if possible it should happen all at once in the pdf uploading process. It will probably save a lot of resources. On another note, i checked out this same issue with mozila's inbuilt pdf viewer and hypothes.is's pdf annotator as well (both being open source) but neither of them seems to have this issue. Any idea how they manage and if same source code can be used? Mozilla doesnt have the ability to annotate and highlight, but other plugins based on mozila's pdf annotators work pretty smooth too. Also, thanks for the nb1 link! |
So far as I know every platform and browser has converged on use of the
pdf.js library to render pdf to html. I presume hypothesis is using
that library to do what I described. It's a simple matter of
programming to add this functionality to NB; we just haven't had the
resources for it.
…On 12/30/2022 1:07 PM, semisenioritis wrote:
Exactly, preprocessing is something that i think is happening on the
client side, and if possible it should happen all at once in the pdf
uploading process. It will probably save a lot of resources.
On another note, i checked out this same issue with mozzila's inbuilt
pdf viewer and hypothes.is's pdf annotator as well (both being open
source) but neither of them seems to have this issue. Any idea how
they manage and if same source code can be used?
Also, thanks for the nb1 link!
—
Reply to this email directly, view it on GitHub
<#203 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIWSXXASTFCNBB7O6BRSWLWP4QG7ANCNFSM5KZOBU5Q>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Ahh, got it. But i briefly looked at the nb2 source code and you also use pdf.js . Sorry in advance if its a basic question. |
Right; we use the same canonical library as everyone else. But we're running it every time on the client, when instead it really ought to be run once on the server. |
If you are looking to contribute this would be a very nice issue to work on. |
Im planning to modify nb a bit for my own requirements and I really need to be able to work with big files for this. I'd love to contribute! |
I'd love for you to contribute back anything you think could be helpful to others. In particular this prerendering of large pdfs would be of great general benefit. NB1 did this (it rendered into images instead html, but same idea). I take it you've already found the client and server code. We're active on the repo discussion and happy to help out if you need help understanding or finding specific things. |
Yup I have already setup nb2 on my laptop, but my system kept on crashing because of the local hosting. I think that for some reason, nb2 does both pre-rendering and client side rendering as it took twice the amount of time for my local nb than the hosted nb. Just a guess though. I'll start with figuring out how nb1 rendered images so that I can use that here. |
I recommend against the nb1 approach.
in NB1 we rendered PDFs to images, which loses all information about the
flow of text. that's why you can only highlight rectangles in NB1---the
lines don't exist at that point. For most applications, preserving the
text flow by rendering to HTML is far superious.
One special case is image annotation---there you *do* want to be able to
highlight and annotate specific rectangular parts of the image, since
there is no flowing text. You can do that in NB1 since any embedded
images also get flattened onto the pdf image. In contrast, right now in
NB2 you can only annotate all or none of the image. Fixing that is
also on the todo list.
You may be wondering, why NB2 is *less powerful* then NB1 on these
issues; the answer is that NB2 is a dramatic improvement over NB1 in a
huge number of other directions, while what I've discussed are the
(only) two key sacrifices we made to get there.
In particular, the NB1 code is a complete nightmare. You won't find
anything of use there.
…On 12/30/2022 1:43 PM, semisenioritis wrote:
Yup I have already setup nb2 on my laptop, but my system kept on
crashing because of the local hosting. I think that for some reason,
nb2 does both pre-rendering and client side rendering as it took twice
the amount of time for my local nb than the hosted nb. Just a guess
though.
I'll start with figuring out how nb1 rendered images so that I can use
that here.
—
Reply to this email directly, view it on GitHub
<#203 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIWSXVOR77GDG34PFEIHZLWP4UMXANCNFSM5KZOBU5Q>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Totally agreed. Especially the points about pdf to images. I initially wanted to use nb1 when that was the only option available but then I realized that the without fine control over the text the context of the related question would be lost on the readers. This wouldn't be a very big issue, and was easily workaround able, but just made me postpone my project for later. Nb2 did initially feel like more of a frontend modification at the cost of speed, but as i went deeper, i realized that a lot of features were added making it more user-friendly. But if I shouldn't even refer to nb1 code, where is a good place to start? |
Are you asking specifically about how to tackle server side rendering in nb2? |
Yes. Maybe some resource or something I can look into or something that already implements this well. |
This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html |
At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request. |
really helpful, thanks! |
why not just save the generated html file on the server, deleting the original pdf?
|
what im thinking is that once the professor uploads the file on the server, the server takes the file converts it to a html file and saves that file for all later use. |
It seems that converting pdfs to html documents doesnt always workout and most of the files have their own specific fonts without which the file gets corrupted. |
We definitelyy don't want to *delete* the pdf, because there will be
some pdfs whose renderings will get better as newer versions of the
pdfjs library are released. But we should indeed be saving the
generated html file on the server.
Note that running pdfjs on the server to do the conversion should be
easy; it's a js library and our nodejs server is js based. I don't
know if pdfjs offers any warnings when it has trouble converting; if so
we should get those delivered back to the person who uploaded the pdf.
…On 1/5/2023 11:45 AM, semisenioritis wrote:
why not just save the generated html file on the server, deleting the
original pdf?
At present, nb_viewer fetches the target pdf from the nb server,
then uses the pdf.js library to convert it to html that nb can
annotate. we should instead be using the same pdf.js library on
the /server/, to convert the pdf to html there once, then save the
resulting html in a suitable cache directory so that html can be
served on request.
—
Reply to this email directly, view it on GitHub
<#203 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIWSXR2TEWXIYSU6XIOZPLWQ33EJANCNFSM5KZOBU5Q>.
You are receiving this because you commented.Message ID:
***@***.***>
|
pdfs that cannot be converted are just as big a problem with the current system as they would be with server-side conversion---it's the same library either way. So we're no worse off doing the conversion server side. But such problematic pdfs are rare and getting rarer, because pdfjs is also the library that gets used by firefox to render pdfs in the browser, so it gets lots of attention. Google chrome uses a different conversion library, pdfium, for the same purpose. We could use that library instead of pdfjs if we decided it was more robust. Pdfium would have to run in a separate process since it isn't js based, but we could easily have our server invoke it at need, using for example this python wrapper. |
Riiight, that makes sense. Ill try this |
@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation? I dont mind switching to pdfium but if i can I'd prefer staying close to the source code |
Not a bother, happy to help! Another resource that might be helpful is hypothesis |
It might be worth investigating online which of pdf.js and pdfium is
considered most robust/able to handle the most pdf weirdness/produces
the best html
all we do is invoke it for conversion, so the coupling to nb is very
light---so it would probably be quite easy to switch, though we would
need to keep using pdfjs for the legacy documents since we rely on the
converted html being the same every time.
…On 1/7/2023 8:43 PM, Jumana Almahmoud wrote:
@JumanaFM <https://github.com/JumanaFM> sorry for bothering you
again and again but is there any documentation for pdf.js at all?
no matter where I search I cant seem to find any documentation for
the library at all. The official docs point to links that are
incomplete and the only documentation that exists is user
contributed and doesn't make a lot of sense
((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)).
Where did you refer for the documentation?
I dont mind switching to pdfium but if i can I'd prefer staying
close to the source code
Not a bother, happy to help!
The best resource is the official page
https://mozilla.github.io/pdf.js/
Another resource that might be helpful is hypothesis
https://github.com/hypothesis/pdf.js-hypothes.is
—
Reply to this email directly, view it on GitHub
<#203 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIWSXT2WJVDFA6ZNQDPOCLWRILU3ANCNFSM5KZOBU5Q>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks a lot!! I found a few more random resources, but the best docs are in the examples on the official page itself. Not a lot to go by, but you can get a brief overview. |
Sure ill look into comparing both too |
currently in nbclient, if a pdf has one page, the sidebar does not load.
The text was updated successfully, but these errors were encountered: