-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DSS-3207: added memory setting usage parameter @ dss-pades-pdfbox #177
base: develop
Are you sure you want to change the base?
Conversation
…or pdfbox and openpdf
Done. Actually, I have changed some PDFBoxUtil's methods' interface.
Done, however the mapping cannot be 1-to-1 due of different level of underlying support
I tried.. I propose (and I can do in another ad-hoc PR) to configure spotless to also format the code
Done but I am not satisfied of the results. Often the memory consumption isn't reliable and test may fail (I commented the asserts). Anyway the activity is fine, for example If I run test with small max heap value the memory only setting test fails instead the file one succeed. P.S. I run the spotless maven goal and it come out many files were without the license header. |
@bsanchezb can you have a look, please? Maybe there are too changes.. do I need to remove the license commit? |
Dear Andrea, Thank you for the PR, it looks very good. I have added a few comments directly in the code, could you please take a look and address the issues? I would also highly appreciate, if you could add a documentation about the near future, within the cookbook, just after the DSSResourcesHandler chapter. KR, |
@bsanchezb I do not see any comments, have you submitted a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please start from using PdfMemoryUsageSetting directly in PdfBoxDocumentReader/ITextDocumentReader constructors instead of implementation-specific classes. This should simplify the code and leave conversion to the very end.
FileDocument fileDocument = (FileDocument) dssDocument; | ||
this.pdDocument = PDDocument.load(fileDocument.getFile(), passwordProtection, memoryUsageSetting); | ||
} else if (dssDocument instanceof InMemoryDocument) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a special handling for InMemoryDocument? Is generic dssDocument.openStream() not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PDDocument.load(stream, ..) needs twice the memory so if it is already in memory I skip such loading making it more efficient.
You can check the code of PdfBox at org.apache.pdfbox.io.ScratchFile.createBuffer(InputStream input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
if (ITextPdfMemoryUsageSetting.Mode.FULL.equals(memoryUsageSetting.getMode())) { | ||
this.pdfReader = new PdfReader(filenameSource, passwordProtection); | ||
} else { | ||
this.pdfReader = new PdfReader(new RandomAccessFileOrArray(filenameSource), passwordProtection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not the same, as new PdfReader(filenameSource, passwordProtection) ?
Should not be
new PdfReader(new RandomAccessFileOrArray(filenameSource, true, false), passwordProtection);
used instead?
I would even go further, and simplify the code to:
if (dssDocument instanceof FileDocument && ITextPdfMemoryUsageSetting.Mode.FULL.equals(memoryUsageSetting.getMode())) {
FileDocument fileDocument = (FileDocument) dssDocument;
String filenameSource = fileDocument.getFile().getAbsolutePath();
this.pdfReader = new PdfReader(filenameSource, passwordProtection);
} else {
try (InputStream is = dssDocument.openStream()) {
this.pdfReader = new PdfReader(is, passwordProtection);
}
}
considering the implementation of RandomAccessFileOrArray class.
I also find ITextPdfMemoryUsageSetting modes FULL and PARTIAL confusing. I would prefer FILE and MEMORY to be used instead,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reply starting from the last sentences.
There are 2 different but very related topics: loading mode (full or partial) and memory setting (file, memory).
Clearly, they are related because if the loading mode is full than all pdf has to be loaded in memory.
At best of my knowledge, each library offers different level of configuration among such topics.
For OpenPdf, as a conversational reference, take a look at: https://stackoverflow.com/questions/62164906/use-of-randomaccessfileorarray
Check by yourself the code of the PdfReader constructors regards the last loading line: readPdfPartial(); for RandomAccessFileOrArray or readPdf(); for the other ones.
PdfBox models it differently and inherently uses partial mode when file usage is requested.
Regards new RandomAccessFileOrArray(filenameSource, true, false)
won't be okay because will load all pdf into memory so as a full load (instead of partial load).
Anyway, thanks to your suggestion, I notice the com.lowagie.text.Document.plainRandomAccess attribute and the same semantic into the RandomAccessFileOrArray constructor.
In case of FileDocument AND partial loading I can set that flag to true so the content will be loaded without mapping memory.
I see here an issue of expressivity of the common model and the implementation specific ones. It might be desirable to config whatever combination but I guess that wrapping it is not a solution so just 'reasonable' combination can be provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you can leave it for now. I will take a look later in more detail
* Load options | ||
*/ | ||
public enum Mode { | ||
FULL, PARTIAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change to FILE and MEMORY modes. I do not see how "PARTIAL" is possible with RandomAccessFileOrArray implementation. Do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have commented elsewhere regards FULL/PARTIAL and FILE/MEMORY topics. Hope that clarify.
* MemoryUsageSetting's load options | ||
*/ | ||
public enum Mode { | ||
MEMORY, FILE, MIXED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add javadocs for different mode enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per other comments, the mapping is pretty implementation biased. Shall I write that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file essential for the unit test provided? Not possible to achieve the same result with "big_file.pdf" document? I would prefer to avoid committing this big document to the repository, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done but I am not satisfied of the results. Often the memory consumption isn't reliable and test may fail (I commented the asserts).
Anyway the activity is fine, for example If I run test with small max heap value the memory only setting test fails instead the file one succeed.
Moreover, I think that the unit test should be about using properly the underlying libraries.. not checking their behaviour so unit testing also them.
Can you have a look here?
Test will often fail with the "big_file.pdf". As per my previous comment, I have left this PR in draft mode due of this open topic.
I think it is the wrong approach to verify the memory consumption. I propose to check that the PDF libs are called as supposed to be and not "unit testing" also their behaviour.
Please advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is the wrong approach to verify the memory consumption. I propose to check that the PDF libs are called as supposed to be and not "unit testing" also their behaviour.
Yes, we may follow this approach instead
* @throws IOException if an exception occurs | ||
* @throws eu.europa.esig.dss.pades.exception.InvalidPasswordException if the password is not provided or | ||
* invalid for a protected document | ||
*/ | ||
public PdfBoxDocumentReader(DSSDocument dssDocument, String passwordProtection, MemoryUsageSetting memoryUsageSetting) | ||
public PdfBoxDocumentReader(DSSDocument dssDocument, String passwordProtection, | ||
MemoryUsageSetting memoryUsageSetting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about propagating generic PdfMemoryUsageSetting till PdfBoxDocumentReader/ITextDocumentReader constructors, instead of using implementation specific classes? This would help to avoid calling PdfBoxUtils.getMemoryUsageSetting on every call of PdfBoxDocumentReader constructor. Additionally, this may help to avoid creation of OpenPdf specific class ITextPdfMemoryUsageSetting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay to update PdfBoxDocumentReader/ITextDocumentReader constructors to take PdfMemoryUsageSetting instead of specific implementation class.
Instead, it will not avoid the creation of the specific OpenPdf one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, it will not avoid the creation of the specific OpenPdf one.
Should be possible to avoid, but you may start the implementation and we will see after how we can adjust the OpenPdf implementation
* @return {@link DSSDocument} PNG screenshot | ||
*/ | ||
public static DSSDocument generateScreenshot(DSSDocument pdfDocument, int page) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep default simplified methods within PdfBoxUtils class, with an in-memory configuration of PdfMemoryUsageSetting loaded by default.
@@ -60,42 +63,42 @@ void init() { | |||
|
|||
@Test | |||
void generateScreenshotTest() { | |||
DSSDocument screenshot = PdfBoxUtils.generateScreenshot(sampleDocument, 1); | |||
DSSDocument screenshot = PdfBoxUtils.generateScreenshot(sampleDocument, 1, MemoryUsageSetting.setupMainMemoryOnly()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you keep old PdfBoxUtils methods, no need to modify the corresponding unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want I create both version of each method?
I won't keep only the old PdfBoxUtils methods since that will always load into memory and it is not the deal. Am I correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be desirable to have overloaded methods, so simple as well as advanced configuration is possible.
If it becomes too many of variables, we may also think about creation of a separate class for screenshot generation, for instance using a builder or a factory design patterns.
import java.util.List; | ||
import java.util.stream.Collectors; | ||
|
||
import org.apache.commons.lang3.tuple.Pair; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid apache lang3 lib in the unit tests. It should be possible to count memory consumption for you use case "on the fly" within #doAllSigns method. That is the only used value, if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is commented code that I left waiting your advice that's because...
Please have a look at the other comment about 'unit testing' the pdf libraries.
Hello @roy20021, Thank you for the comment. I did not know it was not visible to you. Now you should have the review available. KR, |
@bsanchezb I have commented your observations, can you see them? |
Hi Andrea, Yes, I can see them, thank you. I will take a look on them later, currently I do not have much time. Best regards, |
Hi @roy20021 , I did a quick review of your proposals. In general I agree with your arguments. You may try to implement the proposals with constructors and methods arrangement and we will see if it has any additional room for improvement. KR, |
Follow-up of #176