open parse seems missing some blocks within pdf file #40

DinoLiww · 2024-04-29T04:42:52Z

Initial Checks

I confirm that I'm on the latest version

Description

Hi there,

Thanks for your open parse 1st and it looks cool in most of the time.
But when I try to bring my real world tasks into OP and it seems some problems come up.
when I run openparse_quickstart.ipynb to parse some pdf files as attached, PO actually.
it seems open parse missing some blocks within the pdf files.
Please kindly let me know how to move.

Thanks!
Dino

ASE.PDF

Amkor.PDF

Example Code

pdf = openparse.Pdf(basic_doc_path)
pdf.display_with_bboxes(
    parsed_basic_doc.nodes,
)

Python, open-parse & OS Version

running within colab

The text was updated successfully, but these errors were encountered:

lngr · 2024-05-24T15:30:24Z

The default processing pipeline skips small blocks.
You must adjust the processing pipeline, see here - adjust max_area_pct for RemoveFullPageStubs.

I'm using this pipeline with good results:

class MyIngestionPipeline(processing.IngestionPipeline):
    def __init__(self):
        self.transformations = [
            processing.RemoveTextInsideTables(),
            processing.CombineNodesSpatially(criteria="either_stub"),
            processing.CombineBullets(),
            processing.CombineHeadingsWithClosestText(),
            processing.RemoveFullPageStubs(max_area_pct=0.20),
        ]

Dinoliwww · 2024-05-25T02:58:34Z

Thank you Ingr! it looks fine now.
I changed the demo code like this

from openparse import processing

class MyIngestionPipeline(processing.IngestionPipeline):
def init(self):
self.transformations = [
processing.RemoveTextInsideTables(),
processing.CombineNodesSpatially(criteria="either_stub"),
processing.CombineBullets(),
processing.CombineHeadingsWithClosestText(),
processing.RemoveFullPageStubs(max_area_pct=0.20),
]

parser = openparse.DocumentParser(
processing_pipeline=MyIngestionPipeline(),
table_args={"parsing_algorithm": "pymupdf"}
)

basic_doc_path = ".\ASE#1_KR_B.PDF"
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
display(node)

and now it can present all of info I need to handle.

But 2 questions left here, hope can get your feedbacks

what does (max_area_pct=0.20) mean? I tried 0.1-0.9, it looks same,
I can only deael with table_args={"parsing_algorithm": "pymupdf"} ?
sometimes the product table info which has been extracted out, there are sequence problems

Filimoa · 2024-05-28T20:02:37Z

@DinoLiww max area controls the maximum size of an element. In this case it's based on a percentage of page size. If you don't have elements that are getting dropped, then it's because there's no elements that take up more than 10% (0.1) of a page. This is mostly used to filter out pages that have massive text (like a title) that aren't helpful in a RAG pipeline.

As for your second point, can you eleborate?

DinoLiww added the bug Something isn't working label Apr 29, 2024

lngr mentioned this issue May 24, 2024

Missing parts of documents #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open parse seems missing some blocks within pdf file #40

open parse seems missing some blocks within pdf file #40

DinoLiww commented Apr 29, 2024

lngr commented May 24, 2024

Dinoliwww commented May 25, 2024

Filimoa commented May 28, 2024 •

edited

Loading

open parse seems missing some blocks within pdf file #40

open parse seems missing some blocks within pdf file #40

Comments

DinoLiww commented Apr 29, 2024

Initial Checks

Description

Example Code

Python, open-parse & OS Version

lngr commented May 24, 2024

Dinoliwww commented May 25, 2024

Thank you Ingr! it looks fine now. I changed the demo code like this

for node in parsed_basic_doc.nodes: display(node)

Filimoa commented May 28, 2024 • edited Loading

Thank you Ingr! it looks fine now.
I changed the demo code like this

for node in parsed_basic_doc.nodes:
display(node)

Filimoa commented May 28, 2024 •

edited

Loading