-
Notifications
You must be signed in to change notification settings - Fork 35
Unicode enhancement #90
base: master
Are you sure you want to change the base?
Conversation
…ns (load-error-handling etc) removed, because they don't exist on Ubuntu 14.04
Horst, Thanks for your pull request. A few thoughts:
Thanks for your interest - much appreciated! |
Am 20.09.2015 um 20:32 schrieb Andrew Ferrier:
||
|
…a-error-handling added, bugfix: plain text got an html.escape to display '<'
The topic is also mentioned here: #34 Now one of the .eml files I ran the script on also produced:
and I'm trying to work out the most efficient way to fix it. I ran eml2pdf with a bash script on a collection of files in a directory and noticed that some of the resulting pdfs had correct unicode while others were messed up. Then I added --encoding to the command and it could be switched around which were good and which were bad. The contents of the directory looks like this:
So digging deeper:
What a mess. Does this script read out these things? I think thunderbird does when you print to pdf. Now I'm even wondering if it is wkhtmltopdf not eml2pdf which is not reading out the charset Actually, I tried to run the script on the .eml files individually first grepping the charset out of the .eml files and that still did not help it! Printing to pdf file from thunderbird does work, and who knows what magic thunderbird is doing to find out the encoding. https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset There is even more to it: https://stackoverflow.com/questions/39235436/python-auto-detect-email-content-encoding
I give up for now and just need to urgently get the task done, will use Thunderbird and it's gui manually for all the files, but maybe someone one day posts a solution or fix. I don't envy @andrewferrier 's task of working out all these encoding's, charsets and Content-Transfer-Encoding 's Apart from the unicode issue the script seems to work perfectly. |
the main part of this pull request is to enhance the unicode support. The changes mainly consist of using the correct .encode() and .decode() calls. You might be surprised about the function _decodetxt(), which I have taken from my own project. The main reason is, that the payload.get_payload(decode=True) simply does not return unicode. That's the "it's not a bug, it's a feature"-thing. The maintainer of the python email module believes, he has to handle it this way due to a backwards compatibility. So "decode=True" is switched off and the _decodetxt() function does return utf8.
The other smaller changes are a "--overwrite" command line option, so that email2pdf does not stop working when the output file does already exist.
Last not least I had to remove in line 351 the wkhtml2pdf options '--load-error-handling' and '--load-media-error-handling'. These do not exist in Ubuntu 14.04