-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce IBlock and ILettersBlock interfaces #864
base: master
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,306 @@ | |||
namespace UglyToad.PdfPig.DocumentLayoutAnalysis.Export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy of PageXmlTextExporter
with changes to make it more general. With this you pass a list of IBlocks and it generates the PageXml for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you exclude this class from the PR? This could be added back later on in a different PR if need be. Ideally, we want to avoid creating different versions of classes that do the same thing (PageXmlGeneralExporter
and PageXmlTextExporter
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will Raise a separate PR and can discuss on that
@@ -0,0 +1,306 @@ | |||
namespace UglyToad.PdfPig.DocumentLayoutAnalysis.Export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you exclude this class from the PR? This could be added back later on in a different PR if need be. Ideally, we want to avoid creating different versions of classes that do the same thing (PageXmlGeneralExporter
and PageXmlTextExporter
)
/// <summary> | ||
/// Interface for classes with a bounding box | ||
/// </summary> | ||
public interface IBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change IBoundingBox
to IBlock
? Let's revert back to IBoundingBox
as the intent is clearer
/// </summary> | ||
TextOrientation TextOrientation { get; } | ||
|
||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you trying to achieve here by adding the letters collection? Letters
collection should only be at Word
level.
Having a collection of Letters
at Letter
level does not make sense and is counterintuitive... Similar can be said for TextBlock
and TextLine
classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the reading orders use TextSequence
to order words. This exposes the letters so we can get that.
"Having a collection of Letters at Letter level does not make sense and is counterintuitive... ". Goal there was to make Letters behave similar to other text blocks. But your comment makes sense so I'll change the Letter
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be able to compute a TextSequence
value at TextBlock
and TextLine
level, exposing IReadOnlyList<Letter>
is not the solution (as such, the ILettersBlock
interface does not make sense to me). You need to find another way to compute the TextSequence
value...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ILettersBlock
interface represents a collection of letters. Anyone can implement their own class that follows this contract as long as that class represents a collection of letters. E.g You can create a Paragraph
class that implements this interface. It would make sense then to expose the Letter
's that form the collection
For issue #855 there are effectively 2 types of behaviour in the reading orders. Using the TextSequence
of letters or using the BoundingBox
in order to determine reading order.
The implementation for when the BoundingBox
is used is to create an IBoundingBox
interface and use the BoundingBox
property.
For the TextSequence
there are 3 choices I tried:
- Compute the Average TextSequence in the Reading Orders. This was currently done in multiple places. To reflect the IBoundingBox changes in the next PR this would mean introducing a generic function that checks type and if type is
TextBlock
TextLine
orWord
then return calculate the average TextSequence. This option is more complex (generics can be painful) and limits theRenderingReadingOrderDetector
to only work with the above classes.UnsupervisedReadingOrderDetector
also has an option to use RendingOrder which would be limited to the classes above. ILettersBlock
with a fieldAverageTextSequence
. This works but you force every implementer ofILettersBlock
to implementAverageTextSequence
. Not necessarily useful for anything but the ReadingOrderDetectorsILettersBlock
exposeIReadOnlyList<Letter>
This felt like the best solution asRenderingReadingOrderDetector
andUnsupervisedReadingOrderDetector
code was simpler (if it has letters property then we know we can calculate the average text sequence), this interface then represents a collection of Letters and there's tons of fun stuff library users can do with that :). And it's consistent with the use of a property on the interface as we do when ordering by Bounding Boxes.
With option 3
UnsupervisedReadingOrderDetector
looks like this with generics.
public IEnumerable<TBlock> Get<TBlock>(IEnumerable<TBlock> blocks)
where TBlock : IBoundingBox
{
// Ordered by is a stable sort: if the keys of two elements are equal, the order of the elements is preserved
var ordered = blocks.OrderBy(b => GetAverageTextSequenceOrDefaultToZero(b));
//..
return ordered;
}
private double GetAverageTextSequenceOrDefaultToZero<TBlock>(TBlock block)
where TBlock : IBoundingBox
{
if (block is ILettersBlock textBlock)
{
return textBlock.Letters.Average(x => x.TextSequence);
}
return 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think the approach you want to take is too generic. For example
UnsupervisedReadingOrderDetector
only makes sense atTextBlock
level and should stay the same (see paper is used at the time). This does not mean we could not haveReadingOrderDetector
at other levels. I do agree with the need of aIBoundingBox
interface though.
Nothing prevents you to do the following:
public interface IReadingOrderDetector<T> where T: IBoundingBox
{
IEnumerable<T> Get(IReadOnlyList<T> textBlocks);
}
[...]
public class RenderingReadingOrderDetector : IReadingOrderDetector<TextBlock>
{
[...]
}
[...]
public class UnsupervisedReadingOrderDetector : IReadingOrderDetector<TextBlock>
{
[...]
}
One mistake (I believe) I did at the time was to set a reading order property at TextBlock
level (block.SetReadingOrder(readingOrder++);
). This property is in fact not needed and is redundant (DefaultReadingOrderDetector
doesn't even sets it).
- Also, exposing these
Letter
collection everywhere means you are allocating a lot of arrays, which is something we really want to avoid (see previous PRs about optimisation by @iamcarbon):
words.SelectMany(w => w.Letters).ToList().AsReadOnly();
All that being said, I feel we moved too far away from the original problem raised in #844. In line with what was described in the issue, I too believe the issue is with
I guess this issue is due to: XYLeaf.GetLines use Words.Groupy(x => x.BoundingBox.Bottom) is not robust if words baselines/bottoms have some tiny differences.
I hope you won't take my feedback too harshely as you've already contributed greatly to the repo, but I'm not willing to merge the PR as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues with your feedback. It's a productive discussion as several ways to implement this feature. I had some of the questions you raised when implementing so happy to discuss them with someone rather than just me :)
The issue in question here is #855. #844 will be raised on a separate PR but depends on this for some of the refactoring I performed.
Your comments:
Approach too generic
a. UnsupervisedReadingOrderDetector
->
- I added some tests at the
Word
level and it doesn't make sense. But UnsupervisedReadingOrderDetector does make sense at theTextLine
level, and would make sense if someone creates blocks that are a combination ofTextBlocks
and something else (Example is I've used it for my own custom class is a wrapper around aTextBlock
or aTable
from TabulaSharp which you wrote). So I would argue that it is generic enough to be used at any level apart from the word level (Any class that implementsIBoundingBox
) - Secondly the interface should be generic enough that someone can contribute to the library their own reading detector orders for
Word
's. For example a words in a line reading order detector. In the follow on PR I added something like that.
Nothing prevents you to do the following:
I tried that with the initial implementation. Following the argument in a) that is should be generic enough then we have to include the T parameter in the class definition otherwise we have multiple UnsupervisedReadingOrderDetector
one for each type and if a user has their own custom type things get funky... . The problem I ran into here is that this breaks compatabilty with existing code using the library as you need add the T parameter when constructing:
public class RenderingReadingOrderDetector<T> : IReadingOrderDetector<T>
{
[...]
}
public class UnsupervisedReadingOrderDetector<T>: IReadingOrderDetector<T>
{
[...]
}
var it = new RenderingReadingOrderDetector<TextBlock>()
Using a generic on the method keeps compatibilty with existing code.
c. SetReadingOrder -> I kept it in for compatibilty sake but happy to take it out.
Also, exposing these Letter collection everywhere means you are allocating a lot of arrays, which is something we really want to avoid (see previous PRs about optimisation by
I'm happy to change it to a lazy property that computes when fetched. That would remove the performance issue
public IReadOnlyList<Letter> Letters => words.SelectMany(w => w.Letters).AsReadOnly();
It can also be cached in a field after computed the first time to make it more efficent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BobLd Just wanted to follow up on above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davebrokit I'm still here, I just don't have the time short term to focus on this area of PdfPig. I'll try to make time soon to have another look at how to address the points we discsussed above
/// <summary> | ||
/// This letter as as List of Letters in order to implement ILettersBlock interface | ||
/// </summary> | ||
public IReadOnlyList<Letter> Letters => [this]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment
@BobLd are we waiting on any further changes for this PR? |
@EliotJones I'd like to leave this PR open as the point addressed here are relevant. I won't have time short term to focus on this part of the library as i'd like to focus on bug fixes and implementing missing parts of the pdf specifications |
It'll be great to get some progress on this PR. I've responded to your comments @BobLd on the changes you requested: Done:
Todo (was waiting for response to my comment before making the change):
The contentious issue is should It'll be great to get a response as this is functionality I'm using and the PR has been hanging around for 3 months 😞 I also have 2 follow on PRs that are dependent on this: One to address the line segementer issue raised in #844 and one to change the Reading order detectors that finish off this issue #855. The code has been written for them, but I haven't raised them yet as changes may need to be made following the outcome of this discussion. |
Part of #855