-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Non strict loading not working #47
Comments
Whether this is a bug in the I have the same issue, and it is caused by MyHeritage splitting CONC lines between a two-byte unicode character. The resulting line obviously no longer can be understood as Unicode. I am hacking around this by catching the Unicode exceptions, and in case of a CONC line, concatenating the line with the next line (discarding the extra line break and the extra CONC on the next line), and trying again.
|
I can confirm, that this can happen with MyHeritage CONC line splitting inside UT-8 chars, but i meet it with export from webtrees, where it happens in middle of the line. Of course, the webtrees's export problem can be caused by previous MyHeritage (broken UTF-8) import, but anyway, your hack doesn't help with this. I will suggest to catch the UnicodeDecodeError and reraise it, with line number added, to one can manually investigate/fix the file, because current error lack any info where the problem happens. |
Can you attach the shortest possible example of the issue? |
I am sorry, too late ;-) I fixed problems and delete broken files. |
I ran into this with ellipsis and "dot" characters in "notes" fields. I solved it in parser.py line 144 via:
so that strict=False now replaces odd bytes with "?" (the replace default). |
If the export from myheritage chops up your cyrillic unicode, then it's very easy to reconstruct them.
|
Describe the bug
I have an example file, where the encoding itself, crashing the loading process.
gedcom_parser.parse_file(file_path, False) # Disable strict parsing
This line receving this crash
one_person_myheritage.rename to ged.log
The example file are attached.
To Reproduce
Run this python lines:
The exception are
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 206: invalid continuation byte
Expected behavior
When using False parameter, there is no reason for this exception.
Additional context
Expected bugfix are in line
last_element = self.__parse_line(line_number, line.decode('utf-8-sig'), last_element, strict)
in function
def parse(self, gedcom_stream, strict=True):
The text was updated successfully, but these errors were encountered: