This script extracts extracts annotations (highlights, comments, etc.) from a PDF file, and formats them as plain text.
The scripts uses colormath
to identify the highlights' colors, see the wiki. The default template uses these colors to determine hierarchy and meaning.
At present, the following annotations are supported:
-
Highlights without an attached comment are output first, as "highlights" with just the highlighted text included.
-
Highlights with an attached comment, and text annotations (not attached to any particular text/highlight) are output next, as "detailed comments".
-
Underline, strikeout, and squiggly underline annotations are output last, as "Nits", with or without an attached comment. The intention of this is to easily separate formatting or grammatical corrections from more substantial comments about the content of the document.
For each annotation, the page number is given, along with the associated (highlighted/underlined) text, if any. Additionally, if the documents includes outlines (aka bookmarks) such as those generated by the hyperref package, those are also used to identify to which section in the document the annotation refers.
See the wiki for more information.
pip install pdfminer.six chardet six colormath Jinja2 pathlib
python setup.py install
pdf-highlights.py FILE.PDF [> OUTPUT]
My own setup:
- Python 3.6
- chardet (3.0.4)
- colormath (3.0.0)
- Jinja2 (2.10)
- pathlib (1.0.1)
- pdfminer.six (20170720)
- six (1.11.0)
There's a Jinja2 template you can adopt as you like. The script exposes the following data to the template:
- highlights annotations
- comments annotations
- editing annotations
- Author
- Title
See the wiki for more information.
Original author is Andrew Baumann. Thank you, Andrew!
This fork is maintained by Sascha A. Carlin.