Skip to content
This repository has been archived by the owner on Dec 11, 2021. It is now read-only.

Unicode enhancement #90

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gpgmailencrypt
Copy link

the main part of this pull request is to enhance the unicode support. The changes mainly consist of using the correct .encode() and .decode() calls. You might be surprised about the function _decodetxt(), which I have taken from my own project. The main reason is, that the payload.get_payload(decode=True) simply does not return unicode. That's the "it's not a bug, it's a feature"-thing. The maintainer of the python email module believes, he has to handle it this way due to a backwards compatibility. So "decode=True" is switched off and the _decodetxt() function does return utf8.

The other smaller changes are a "--overwrite" command line option, so that email2pdf does not stop working when the output file does already exist.

Last not least I had to remove in line 351 the wkhtml2pdf options '--load-error-handling' and '--load-media-error-handling'. These do not exist in Ubuntu 14.04

@andrewferrier
Copy link
Owner

Horst,

Thanks for your pull request. A few thoughts:

  • Regarding the --overwrite option: great idea, I've never needed it so hadn't added one, but I can see that would be useful to some folk besides you. Once I've cleaned up the commit a bit and added some unit tests for it, I'll merge this into email2pdf.
  • Regarding the removal of --load-error-handling and --load-media-error-handling: I don't think I'd like to remove these; they are crucial to working around issues with some types of broken emails, for example ones with broken embedded images. However, I have realised that I should modelling those with unit tests, which I'm not (my unit test suite still passes despite your removal of these flags). I've opened issue Add unit tests to show benefit of --load-error-handling and --load-media-error-handling flags on wkhtmltopdf #91 to represent that. I would suggest that you install the latest version directly from the wkhtmltopdf website as per the install instructions to avoid the error due to missing flags. It's not ideal, I know, but the version is 14.04 is very old.
  • As far as the Unicode support goes; this is the one that confuses me. You say payload.get_payload(decode=True) doesn't return Unicode, but it's been my experience that it does. The Python documentation seems to concur; it says here that it returns a string when is_multipart is False, and I think all Python strings in 3+ are Unicode. Could you please help me by being a bit more specific what your _decodetxt function() is working around? I can't figure it out from your code what you are doing. Do you have any references to the problem on the web? I think the best way to illustrate the issue would be to create a failing test which your code fixes; can you help me figure out how I would do that?

Thanks for your interest - much appreciated!

@gpgmailencrypt
Copy link
Author

Am 20.09.2015 um 20:32 schrieb Andrew Ferrier:

Horst,

Thanks for your pull request. A few thoughts:

Regarding the |--overwrite| option: great idea, I've never needed
it so hadn't added one, but I can see that would be useful to some
folk besides you. Once I've cleaned up the commit a bit and added
some unit tests for it, I'll merge this into email2pdf.
Regarding the removal of the |--load-error-handling|: I don't
think I'd like to remove these; they are crucial to working around
issues with some types of broken emails, for example ones with
broken embedded images. However, I have realised that I should
modelling those with unit tests, which I'm not (my unit test suite
still passes despite your removal of these flags). I've opened
issue #91 <https://github.com/andrewferrier/email2pdf/issues/91>
to represent that. I would suggest that you install the latest
version directly from the wkhtmltopdf website as per the install
instructions
<https://github.com/andrewferrier/email2pdf#debianubuntu> to avoid
that issue. It's not ideal, I know, but the version is 14.04 is
/very/ old.

Yes I agree. I'm writing an encrypting email gateway, where I currently
add encrypted pdf emails. After I made the pull request I installed my
software on the production server (also ubuntu 14.04). Unfortunately the
ubuntu version of wkhtmltopdf needs an installed X server, which is
unacceptable on a server.
So I had to change to a newer wkhtmltopdf package, and the
|--load-error-handling reappeared|

||

As far as the Unicode support goes; this is the one that confuses
me. You say |payload.get_payload(decode=True)| doesn't return
Unicode, but it's been my experience that it does. The Python
documentation seems to concur; it says here
<https://docs.python.org/3/library/email.message.html#email.message.Message.get_payload>
that it returns a string when |is_multipart| is False, and I think
all Python strings in 3+ are Unicode. Could you please help me by
being a bit more specific what your _decodetxt function() is
working around? I can't figure it out from your code what you are
doing. Do you have any references to the problem on the web? I
think the best way to illustrate the issue would be to create a
failing test which your code fixes; can you help me figure out how
I would do that?

This is a really difficult thing and did cost me a lot of time. See
http://bugs.python.org/issue18271 for more information.
The code in _decodetxt is more or less the original decode function,
just that it ensures, that it always delivers unicode.

Thanks for your interest - much appreciated!


Reply to this email directly or view it on GitHub
#90 (comment).

…a-error-handling added, bugfix: plain text got an html.escape to display '<'
@aktivkohle
Copy link

aktivkohle commented Mar 16, 2021

The topic is also mentioned here: #34

Now one of the .eml files I ran the script on also produced:

Traceback (most recent call last):
  File "/usr/bin/email2pdf", line 733, in call_main
    (warning_pending, mostly_hide_warnings) = main(argv, syslog_handler, syserr_handler)
  File "/usr/bin/email2pdf", line 109, in main
    input_data = get_input_data(args)
  File "/usr/bin/email2pdf", line 261, in get_input_data
    data = input_handle.read()
  File "/home/user1/.virtualenvs/email2pdf_env/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 755: invalid start byte

and I'm trying to work out the most efficient way to fix it.

I ran eml2pdf with a bash script on a collection of files in a directory and noticed that some of the resulting pdfs had correct unicode while others were messed up. Then I added --encoding to the command and it could be switched around which were good and which were bad.

The contents of the directory looks like this:

$ ls
1.eml  2.eml  2.pdf  3.eml  4.eml  5.eml  6.eml

So digging deeper:

 $ grep -r charset .
./2.eml:	charset="iso-8859-1"
./6.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
./1.eml:	charset="iso-8859-1"
./1.eml:	charset="iso-8859-1"
./1.eml:charset=3Diso-8859-1">
./5.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
./3.eml:Content-Type: text/plain; charset=UTF-8
./4.eml:Content-Type: text/plain; charset=windows-1252; format=flowed

What a mess. Does this script read out these things? I think thunderbird does when you print to pdf. Now I'm even wondering if it is wkhtmltopdf not eml2pdf which is not reading out the charset

Actually, I tried to run the script on the .eml files individually first grepping the charset out of the .eml files and that still did not help it! Printing to pdf file from thunderbird does work, and who knows what magic thunderbird is doing to find out the encoding.

https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset

There is even more to it:

https://stackoverflow.com/questions/39235436/python-auto-detect-email-content-encoding

$ grep -r Content-Transfer-Encoding .
./2.eml:Content-Transfer-Encoding: quoted-printable
./6.eml:Content-Transfer-Encoding: 8bit
./1.eml:Content-Transfer-Encoding: quoted-printable
./1.eml:Content-Transfer-Encoding: quoted-printable
./5.eml:Content-Transfer-Encoding: 8bit
./3.eml:Content-Transfer-Encoding: quoted-printable
./4.eml:Content-Transfer-Encoding: 8bit

I give up for now and just need to urgently get the task done, will use Thunderbird and it's gui manually for all the files, but maybe someone one day posts a solution or fix. I don't envy @andrewferrier 's task of working out all these encoding's, charsets and Content-Transfer-Encoding 's

Apart from the unicode issue the script seems to work perfectly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants