Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: stop ignoring first line of imported email #484

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skrobul
Copy link

@skrobul skrobul commented Nov 1, 2024

I have stumbled upon a bug where almost all of the restored emails were corrupted. The emails in question seemed to have almost the same text but the formatting was all over the place and some of the words were mangled badly. HTML tables were broken.

Upon investigation I noticed that actual body of the email in the original .eml file and one downloaded from Googles "Download message" was practically identical with exception of few headers. One of those headers was Content-Transfer-Encoding which happened to be very first line of each corrupted email.

Example diff:

$ diff docker_meetup_email_after_gyb_restore.eml original_docker_email.eml
1,11c1
< Authentication-Results: mx.google.com;
<        dkim=neutral (body hash did not verify) [email protected] header.s=s1 header.b="Ivpq/sFe"
< X-Google-Smtp-Source: AGHT+IGEy4ty3doaMTjFqiOkSsSCpk9NLEy/NCs28XMnDJUGPy5CZ54yo2foi5usb9P4cI1hNo4Fqzyh56Lj5OK1xw==
< Received: from 777146845227
<       named unknown
<       by gmailapi.google.com
<       with HTTPREST;
<       Fri, 1 Nov 2024 20:03:11 +0000
---
> Content-Transfer-Encoding: quoted-printable
46a37
> X-Google-Smtp-Source: AAOMgpen7PSnhPReh9WOrpUPOxq9IhkBBjd6pokoxWeGNf9xtEIQtwHrvjIF7wax5u3067qhdJYI
$

After looking into the source code of fmbox.py I noticed that constructor of class fmbox() advances the self._file when initialising the _last_from_line but does not rewind it back which effectively produces a message that is stripped of first line.
Presumably this is not a problem when a message starts with a From header but it is when it's anything else.

At this point I am not sure if this is provider specific or what, but for some context, my .eml files have been created by Proton Mail export tool. The same emails were imported from Google Takeout to Proton few years earlier if that matters.

I have tested the fix by importing about 500 messages and they all display correctly.

This is also likely related to the problem @infovations has seen in #148 as well as #157

I have stumbled upon a bug where almost all of the restored emails were
corrupted. The emails in question seemed to have almost the same text
but the formatting was all over the place and some of the words were
mangled badly. HTML tables were broken.

Upon investigation I noticed that actual body of the email in the
original .eml file and one downloaded from Googles "Download message"
was practically identical with exception of few headers. One of those
headers was `Content-Transfer-Encoding` which happened to be very first
line of each corrupted email.

Example diff:

    $ diff docker_meetup_email_after_gyb_restore.eml original_docker_email.eml
    1,11c1
    < Authentication-Results: mx.google.com;
    <        dkim=neutral (body hash did not verify) [email protected] header.s=s1 header.b="Ivpq/sFe"
    < X-Google-Smtp-Source: AGHT+IGEy4ty3doaMTjFqiOkSsSCpk9NLEy/NCs28XMnDJUGPy5CZ54yo2foi5usb9P4cI1hNo4Fqzyh56Lj5OK1xw==
    < Received: from 777146845227
    <       named unknown
    <       by gmailapi.google.com
    <       with HTTPREST;
    <       Fri, 1 Nov 2024 20:03:11 +0000
    ---
    > Content-Transfer-Encoding: quoted-printable
    46a37
    > X-Google-Smtp-Source: AAOMgpen7PSnhPReh9WOrpUPOxq9IhkBBjd6pokoxWeGNf9xtEIQtwHrvjIF7wax5u3067qhdJYI
    $

After looking into the source code of `fmbox.py` I noticed that
constructor of `class fmbox()` advances the `self._file` when
initialising the `_last_from_line` but does not rewind it back which
effectively produces a message that is stripped of first line.

Presumably this is not a problem when a message starts with a `From`
header but it is when it's anything else.

At this point I am not sure if this is provider specific or what, but
for some context, my .eml files have been created by Proton Mail export
  tool. The same emails were imported from Google Takeout to Proton few
  years earlier if that matters.

I have tested the fix by importing about 500 messages and they all
display correctly.
@jay0lee
Copy link
Member

jay0lee commented Nov 14, 2024

Hmmm...

So this is a difference between mbox files where the first line is the From delimiter (see https://en.wikipedia.org/wiki/Mbox) and .eml files where the first line is an email header.

The proper fix here would be to examine that first line and if it's actually "From " (notice no :) then remove that first line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants