Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Non strict loading not working #47

Open
AKorets opened this issue Apr 18, 2020 · 6 comments
Open

[Bug] Non strict loading not working #47

AKorets opened this issue Apr 18, 2020 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@AKorets
Copy link

AKorets commented Apr 18, 2020

Describe the bug
I have an example file, where the encoding itself, crashing the loading process.
gedcom_parser.parse_file(file_path, False) # Disable strict parsing
This line receving this crash
one_person_myheritage.rename to ged.log
The example file are attached.

To Reproduce

  1. Load the one_person_myheritage.rename to ged.log
  2. rename file to one_person_myheritage.get

Run this python lines:

gedcom_parser = Parser()    
gedcom_parser.parse_file( "one_person_myheritage.ged"  , False) # Disable strict parsing

The exception are
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 206: invalid continuation byte

Expected behavior
When using False parameter, there is no reason for this exception.

Additional context
Expected bugfix are in line
last_element = self.__parse_line(line_number, line.decode('utf-8-sig'), last_element, strict)
in function def parse(self, gedcom_stream, strict=True):

@AKorets AKorets added the bug Something isn't working label Apr 18, 2020
@nickreynke nickreynke self-assigned this May 7, 2020
@nickreynke nickreynke added this to the 2.0.0 milestone May 7, 2020
@mikaelho
Copy link

mikaelho commented Sep 13, 2020

Whether this is a bug in the strict option or something else is debatable.

I have the same issue, and it is caused by MyHeritage splitting CONC lines between a two-byte unicode character. The resulting line obviously no longer can be understood as Unicode.

I am hacking around this by catching the Unicode exceptions, and in case of a CONC line, concatenating the line with the next line (discarding the extra line break and the extra CONC on the next line), and trying again.

        lines = iter(gedcom_file)

        for line in lines:
            take_next = True
            conc_tag = b' CONC '
            while take_next:
                try:
                    line = line.decode('utf-8-sig')
                    take_next = False
                except UnicodeDecodeError:
                    if conc_tag in line:
                        next_line = next(lines)
                        next_payload = next_line[next_line.find(conc_tag) + len(conc_tag):]
                        line = line[:-2] + next_payload
                    else:
                        raise
            last_element = self.__parse_line(line_number, line, last_element, strict)
            line_number += 1

@slavkoja
Copy link

I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.

I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.

@AKorets
Copy link
Author

AKorets commented May 17, 2021

I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.

I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.

Can you attach the shortest possible example of the issue?
Maybe there is easy way to hack around, that I can suggest.

@slavkoja
Copy link

I am sorry, too late ;-)

I fixed problems and delete broken files.

@rjsdotorg
Copy link

I ran into this with ellipsis and "dot" characters in "notes" fields.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 24: invalid start byte

I solved it in parser.py line 144 via:

        for line in gedcom_file:
            try:
                last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
            except UnicodeDecodeError:
                if not strict:
                    print('UnicodeDecodeError found:', line_number, line)
                    try:
                        last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='replace'), last_element, strict)
                    except:
                        print('  replace error:', line_number, line)
                        raise
                else:
                    raise
            line_number += 1

so that strict=False now replaces odd bytes with "?" (the replace default).
It also tells you where it was so that you can fix it in the original database.

@nkapyrin
Copy link

nkapyrin commented Jun 14, 2022

If the export from myheritage chops up your cyrillic unicode, then it's very easy to reconstruct them.
I'm suggesting using the code below within the excellent solution suggested by rjsdotorg, which you could leave in for debug purposes.

    err_flag = 0; # add this (custom code for cyrillic export from myheritage)
    for line in gedcom_file:
        # add this (custom code for cyrillic export from myheritage)
        # if the prev string ended in D1 or D0, fix the 1st letter of the new string
        new_letter = 0
        if err_flag != 0:
            if err_flag == 0xD1 and line[7] >= 0x80: new_letter = (err_flag << 8) + line[7] - 0xcd40
            else: new_letter = (err_flag << 8) + line[7] - 0xcc80
            line = line[:7] + (new_letter).to_bytes(2, 'big') + line[8:]
        # if the new strings ends in D0 or D1 (+\r\n), then we remove the symbol and set the flag
        if line[-3] == 0xD0 or line[-3] == 0xD1:
            err_flag = line[-3]
            line = line[:-3] + line[-2:]
        else: err_flag = 0;
        # END of custom code for cyrillic export from myheritage

        # now back to https://github.com/nickreynke/python-gedcom/issues/47#issuecomment-980783824
        try:
            last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
            # etc ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants