-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anti-issue: YAML::PP parses JSON that all the other perl JSON modules can't! #1
Comments
Heh, that's funny. I'll leave this open until I added validation and a corresponding configuration option. |
Thank you! And yes, it's been a frustrating experience; the folks who generate the data feed don't seem to think it's their problem to solve. |
On some systems inf and nan seem to be broken ('0'): uname='Win32 strawberryperl 5.12.2.0 #1 Fri Nov 5 05:17:27 2010 i386' t/32.cyclic-refs.t dies, add some debugging
Just a comment: |
@choroba what version of JQ? Last I checked, JQ was still throwing errors on this sort of badly formed JSON. |
jq-1.5. It seems 1.6 should be around, too, so maybe it's different. |
Okay, so yes; JQ does parse the above example snippet of butchered UTF JSON; but I've got worse examples that JQ barfs on from the same data feed. Either way, YAML-PP is still the only way I can reliably parse this kind of JSON in perl, and I'm super happy that it still works. |
@warewolf That JSON is invalid due to unpaired half of surrogate pair. How would like you handle and decode invalid JSON? Such string does not have representation in UTF-8, so you cannot load & decode it. I see there two options: 1) Skip every non-parsable byte in input or 2) Replace non-parsable tokens in JSON string by Unicode replacement character. But both options changes input, so when processing it in Perl you would have something different. I understand analytical reasons trying to process as many data as possible, but when on input are invalid data it is needed to specify how to non-reversible handle them. |
@pali well, because of how the JSON is already mangled (non UTF-8 interpreted as UTF-8, which gets completely fubar "this can't be represented in UTF-8") I honestly don't expect this to to be reversible to something consistent. For my use, the actual string values that are corrupted are irrelevant, the rest of the JSON structure I'm parsing does have value, so for me the important part is not bailing on parsing the entire JSON object. Sadly I can't fix the origin data because it's from a commercial data feed, and apparently python gladly will serialize to invalid JSON? |
I can imagine that maintainer of Cpanel::JSON::XS could accept optional feature to process also invalid JSON strings and replace invalid characters by Unicode replacement character. So if you have really use cases (which seems that yes), open an issue/feature request for Cpanel::JSON::XS. |
So yeah, this is an anti-issue - I discovered recently that JSON is "a subset of YAML 1.2"; and then discovered YAML::PP. In short: Thank you. YAML::PP doesn't bomb on JSON that is produced with ham-fisted UTF-8 encoding.
It appears that one company in particular that distributes a data feed has somehow "switched on" interpreting all data ingested as UTF-8, even when it wasn't UTF-8 encoded. Imagine interpreting the header of a ZIP file as Unicode. The result is corrupted garbage, and it isn't standards compliant.
Example:
{"Subject": "CN=\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u0531/OU=\ufffd\ufffd\u01b4\ufffd/OU=\u027d\ufffd\ufffd\ufffd\ude64\ufffd\ufffd\u0467/O=sdlg" }
Nothing else in Perl land seems to be able to parse the above JSON document. YAML::PP does, as of v0.005.
My request: Please let this continue to be the case. If you do end up adding validation of unicode character sequences, give folks an option to turn it off.
The text was updated successfully, but these errors were encountered: