Skip to content

How would you tackle this problem? #851

Answered by JorjMcKie
ErikDz asked this question in Q&A
Discussion options

You must be logged in to vote

A lot of your effort depends on how many more details do you know about the document, that is true for all such PDFs.
Then use more sophisticated text extraction variants to check these details. For example, to me it seemed that all question numbers are bold and start at a certain horizontal position.
I chose getText("dict") output to formulate the following snippet:

>>> for page in doc:
	for b in page.getText("dict",flags=0)["blocks"]:
		for l in b["lines"]:
			for s in l["spans"]:
				if "Bold" in s["font"] and s["bbox"][0] < 50 and s["text"][0].isdecimal():
					print("Question %s starts on page %i, left: %g, top: %g." % (s["text"], page.number, s["bbox"][0], s["bbox"][1]))

					
Ques…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@ErikDz
Comment options

@JorjMcKie
Comment options

Answer selected by ErikDz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants