[Bug] Non strict loading not working #47

AKorets · 2020-04-18T03:32:05Z

Describe the bug
I have an example file, where the encoding itself, crashing the loading process.
gedcom_parser.parse_file(file_path, False) # Disable strict parsing
This line receving this crash
one_person_myheritage.rename to ged.log
The example file are attached.

To Reproduce

Load the one_person_myheritage.rename to ged.log
rename file to one_person_myheritage.get

Run this python lines:

gedcom_parser = Parser()    
gedcom_parser.parse_file( "one_person_myheritage.ged"  , False) # Disable strict parsing

The exception are
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 206: invalid continuation byte

Expected behavior
When using False parameter, there is no reason for this exception.

Additional context
Expected bugfix are in line
last_element = self.__parse_line(line_number, line.decode('utf-8-sig'), last_element, strict)
in function def parse(self, gedcom_stream, strict=True):

The text was updated successfully, but these errors were encountered:

mikaelho · 2020-09-13T11:27:11Z

Whether this is a bug in the strict option or something else is debatable.

I have the same issue, and it is caused by MyHeritage splitting CONC lines between a two-byte unicode character. The resulting line obviously no longer can be understood as Unicode.

I am hacking around this by catching the Unicode exceptions, and in case of a CONC line, concatenating the line with the next line (discarding the extra line break and the extra CONC on the next line), and trying again.

        lines = iter(gedcom_file)

        for line in lines:
            take_next = True
            conc_tag = b' CONC '
            while take_next:
                try:
                    line = line.decode('utf-8-sig')
                    take_next = False
                except UnicodeDecodeError:
                    if conc_tag in line:
                        next_line = next(lines)
                        next_payload = next_line[next_line.find(conc_tag) + len(conc_tag):]
                        line = line[:-2] + next_payload
                    else:
                        raise
            last_element = self.__parse_line(line_number, line, last_element, strict)
            line_number += 1

slavkoja · 2021-03-20T21:42:37Z

I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.

I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.

AKorets · 2021-05-17T19:32:53Z

I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this.

I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens.

Can you attach the shortest possible example of the issue?
Maybe there is easy way to hack around, that I can suggest.

slavkoja · 2021-05-17T21:01:38Z

I am sorry, too late ;-)

I fixed problems and delete broken files.

rjsdotorg · 2021-11-27T18:46:57Z

I ran into this with ellipsis and "dot" characters in "notes" fields.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 24: invalid start byte

I solved it in parser.py line 144 via:

        for line in gedcom_file:
            try:
                last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
            except UnicodeDecodeError:
                if not strict:
                    print('UnicodeDecodeError found:', line_number, line)
                    try:
                        last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='replace'), last_element, strict)
                    except:
                        print('  replace error:', line_number, line)
                        raise
                else:
                    raise
            line_number += 1

so that strict=False now replaces odd bytes with "?" (the replace default).
It also tells you where it was so that you can fix it in the original database.

nkapyrin · 2022-06-14T23:48:50Z

If the export from myheritage chops up your cyrillic unicode, then it's very easy to reconstruct them.
I'm suggesting using the code below within the excellent solution suggested by rjsdotorg, which you could leave in for debug purposes.

    err_flag = 0; # add this (custom code for cyrillic export from myheritage)
    for line in gedcom_file:
        # add this (custom code for cyrillic export from myheritage)
        # if the prev string ended in D1 or D0, fix the 1st letter of the new string
        new_letter = 0
        if err_flag != 0:
            if err_flag == 0xD1 and line[7] >= 0x80: new_letter = (err_flag << 8) + line[7] - 0xcd40
            else: new_letter = (err_flag << 8) + line[7] - 0xcc80
            line = line[:7] + (new_letter).to_bytes(2, 'big') + line[8:]
        # if the new strings ends in D0 or D1 (+\r\n), then we remove the symbol and set the flag
        if line[-3] == 0xD0 or line[-3] == 0xD1:
            err_flag = line[-3]
            line = line[:-3] + line[-2:]
        else: err_flag = 0;
        # END of custom code for cyrillic export from myheritage

        # now back to https://github.com/nickreynke/python-gedcom/issues/47#issuecomment-980783824
        try:
            last_element = self.__parse_line(line_number, line.decode('utf-8-sig', errors='strict'), last_element, strict)
            # etc ...

AKorets added the bug Something isn't working label Apr 18, 2020

nickreynke self-assigned this May 7, 2020

nickreynke added this to the 2.0.0 milestone May 7, 2020

gorogm mentioned this issue Jan 27, 2022

invalid continuation byte #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Non strict loading not working #47

[Bug] Non strict loading not working #47

AKorets commented Apr 18, 2020 •

edited by nickreynke

Loading

mikaelho commented Sep 13, 2020 •

edited

Loading

slavkoja commented Mar 20, 2021

AKorets commented May 17, 2021

slavkoja commented May 17, 2021

rjsdotorg commented Nov 27, 2021

nkapyrin commented Jun 14, 2022 •

edited

Loading

[Bug] Non strict loading not working #47

[Bug] Non strict loading not working #47

Comments

AKorets commented Apr 18, 2020 • edited by nickreynke Loading

mikaelho commented Sep 13, 2020 • edited Loading

slavkoja commented Mar 20, 2021

AKorets commented May 17, 2021

slavkoja commented May 17, 2021

rjsdotorg commented Nov 27, 2021

nkapyrin commented Jun 14, 2022 • edited Loading

AKorets commented Apr 18, 2020 •

edited by nickreynke

Loading

mikaelho commented Sep 13, 2020 •

edited

Loading

nkapyrin commented Jun 14, 2022 •

edited

Loading