Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open parse seems missing some blocks within pdf file #40

Open
1 task done
DinoLiww opened this issue Apr 29, 2024 · 3 comments
Open
1 task done

open parse seems missing some blocks within pdf file #40

DinoLiww opened this issue Apr 29, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@DinoLiww
Copy link

Initial Checks

  • I confirm that I'm on the latest version

Description

Hi there,

Thanks for your open parse 1st and it looks cool in most of the time.
But when I try to bring my real world tasks into OP and it seems some problems come up.
when I run openparse_quickstart.ipynb to parse some pdf files as attached, PO actually.
it seems open parse missing some blocks within the pdf files.
Please kindly let me know how to move.

Thanks!
Dino
ase-missing-01
ASE.PDF
amkor-001
amkor-002
Amkor.PDF

Example Code

pdf = openparse.Pdf(basic_doc_path)
pdf.display_with_bboxes(
    parsed_basic_doc.nodes,
)

Python, open-parse & OS Version

running within colab
@DinoLiww DinoLiww added the bug Something isn't working label Apr 29, 2024
@lngr
Copy link

lngr commented May 24, 2024

The default processing pipeline skips small blocks.
You must adjust the processing pipeline, see here - adjust max_area_pct for RemoveFullPageStubs.

I'm using this pipeline with good results:

class MyIngestionPipeline(processing.IngestionPipeline):
    def __init__(self):
        self.transformations = [
            processing.RemoveTextInsideTables(),
            processing.CombineNodesSpatially(criteria="either_stub"),
            processing.CombineBullets(),
            processing.CombineHeadingsWithClosestText(),
            processing.RemoveFullPageStubs(max_area_pct=0.20),
        ]

@Dinoliwww
Copy link

Thank you Ingr! it looks fine now.
I changed the demo code like this

from openparse import processing

class MyIngestionPipeline(processing.IngestionPipeline):
def init(self):
self.transformations = [
processing.RemoveTextInsideTables(),
processing.CombineNodesSpatially(criteria="either_stub"),
processing.CombineBullets(),
processing.CombineHeadingsWithClosestText(),
processing.RemoveFullPageStubs(max_area_pct=0.20),
]

parser = openparse.DocumentParser(
processing_pipeline=MyIngestionPipeline(),
table_args={"parsing_algorithm": "pymupdf"}
)

basic_doc_path = ".\ASE#1_KR_B.PDF"
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
display(node)

and now it can present all of info I need to handle.

But 2 questions left here, hope can get your feedbacks

  1. what does (max_area_pct=0.20) mean? I tried 0.1-0.9, it looks same,
  2. I can only deael with table_args={"parsing_algorithm": "pymupdf"} ?
    sometimes the product table info which has been extracted out, there are sequence problems

@Filimoa
Copy link
Owner

Filimoa commented May 28, 2024

@DinoLiww max area controls the maximum size of an element. In this case it's based on a percentage of page size. If you don't have elements that are getting dropped, then it's because there's no elements that take up more than 10% (0.1) of a page. This is mostly used to filter out pages that have massive text (like a title) that aren't helpful in a RAG pipeline.

As for your second point, can you eleborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants