How would you tackle this problem? #851

ErikDz · 2021-01-21T20:24:31Z

ErikDz
Jan 21, 2021

Hi, I've spent a couple of weeks trying to figure out this problem, and the more I work on it, the less it seems that I'm headed on the right direction.

So here's the problem on hand:
I have these exam papers on pdfs. What I want to do, is to separate each question into a different document with their respective answer.
What's the big problem?
Indicating the program where to set the crop in order to get the entire question.

What have I done so far?

Get the whole PDF to text format
Tried to search for the beggining of a question by searching for a number + letter + ")".
Make an array with each element being the text from a question
Search the first 100 letters of a question on the PDF to find the coordinates
Crop
But that doesn't quite work...

So how would you tackle this problem?
I've attached sample documents onto this thread.
qp -> question paper
ms -> mark scheme

9608_s17_qp_32.pdf
9608_s17_ms_32.pdf

Answered by JorjMcKie

Jan 22, 2021

A lot of your effort depends on how many more details do you know about the document, that is true for all such PDFs.
Then use more sophisticated text extraction variants to check these details. For example, to me it seemed that all question numbers are bold and start at a certain horizontal position.
I chose getText("dict") output to formulate the following snippet:

>>> for page in doc:
	for b in page.getText("dict",flags=0)["blocks"]:
		for l in b["lines"]:
			for s in l["spans"]:
				if "Bold" in s["font"] and s["bbox"][0] < 50 and s["text"][0].isdecimal():
					print("Question %s starts on page %i, left: %g, top: %g." % (s["text"], page.number, s["bbox"][0], s["bbox"][1]))

					
Ques…

View full answer

JorjMcKie · 2021-01-22T00:07:10Z

JorjMcKie
Jan 22, 2021
Maintainer

A lot of your effort depends on how many more details do you know about the document, that is true for all such PDFs.
Then use more sophisticated text extraction variants to check these details. For example, to me it seemed that all question numbers are bold and start at a certain horizontal position.
I chose getText("dict") output to formulate the following snippet:

>>> for page in doc:
	for b in page.getText("dict",flags=0)["blocks"]:
		for l in b["lines"]:
			for s in l["spans"]:
				if "Bold" in s["font"] and s["bbox"][0] < 50 and s["text"][0].isdecimal():
					print("Question %s starts on page %i, left: %g, top: %g." % (s["text"], page.number, s["bbox"][0], s["bbox"][1]))

					
Question 1  starts on page 1, left: 49.6063, top: 64.4099.
Question 2  starts on page 4, left: 49.6063, top: 64.4099.
Question 3  starts on page 5, left: 49.6063, top: 519.469.
Question 4  starts on page 7, left: 49.6063, top: 64.4099.
Question 5  starts on page 9, left: 49.6063, top: 64.4099.
Question 6  starts on page 11, left: 49.6063, top: 64.4099.
>>>

At least in this case, all six question starts have been located successfully. The end of a question is:

on the page where the next question starts, if that question's top is > 65 (so e.g. Q 2 ends on page 5, where Q 3 starts)
on the page before the next question's start: all others except the last question
on the page before the first page containing the "BLANK PAGE" text.

2 replies

ErikDz Jan 22, 2021
Author

That is absolutely genius!
Thank you so very much. I was unaware of the feature to get the text as a dictionary. I learn something new every day!

Again, Im very thankful for the answer since this has been such a big problem for so long, and your answer is perfect. This will help so many students!

JorjMcKie Jan 22, 2021
Maintainer

Glad you like it. And I am sure you will discover more useful features going forward.
Have func with PyMuPDF and never hesitate to ask questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How would you tackle this problem? #851

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How would you tackle this problem? #851

ErikDz Jan 21, 2021

Replies: 1 comment · 2 replies

JorjMcKie Jan 22, 2021 Maintainer

ErikDz Jan 22, 2021 Author

JorjMcKie Jan 22, 2021 Maintainer

ErikDz
Jan 21, 2021

Replies: 1 comment 2 replies

JorjMcKie
Jan 22, 2021
Maintainer

ErikDz Jan 22, 2021
Author

JorjMcKie Jan 22, 2021
Maintainer