-
Hi, I've spent a couple of weeks trying to figure out this problem, and the more I work on it, the less it seems that I'm headed on the right direction. So here's the problem on hand: What have I done so far?
So how would you tackle this problem? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
A lot of your effort depends on how many more details do you know about the document, that is true for all such PDFs. >>> for page in doc:
for b in page.getText("dict",flags=0)["blocks"]:
for l in b["lines"]:
for s in l["spans"]:
if "Bold" in s["font"] and s["bbox"][0] < 50 and s["text"][0].isdecimal():
print("Question %s starts on page %i, left: %g, top: %g." % (s["text"], page.number, s["bbox"][0], s["bbox"][1]))
Question 1 starts on page 1, left: 49.6063, top: 64.4099.
Question 2 starts on page 4, left: 49.6063, top: 64.4099.
Question 3 starts on page 5, left: 49.6063, top: 519.469.
Question 4 starts on page 7, left: 49.6063, top: 64.4099.
Question 5 starts on page 9, left: 49.6063, top: 64.4099.
Question 6 starts on page 11, left: 49.6063, top: 64.4099.
>>> At least in this case, all six question starts have been located successfully. The end of a question is:
|
Beta Was this translation helpful? Give feedback.
A lot of your effort depends on how many more details do you know about the document, that is true for all such PDFs.
Then use more sophisticated text extraction variants to check these details. For example, to me it seemed that all question numbers are bold and start at a certain horizontal position.
I chose getText("dict") output to formulate the following snippet: