-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Description and implementation of YAML::PP::Schema::Binary does not make sense #28
Comments
So I tried fixing the code by using this check:
However, this produces unwanted output:
Maybe I didn't understand your suggestions correctly? |
Maybe it's simply not possible to detect if data is binary. I will document that only loading binary data this way is recommended. |
Yes, this is not possible. You must declare in API if input should be treated as binary (and therefore only
Lets stop at above line as it does not have to be obvious and simple what is doing here. Unless you specify my $input = decode_utf8("\xC3\xA4");
So in
And what should Binary schema/encoding do? I would say that it should expects on its input binary buffer and produce ouput in YAML marked as binary. And because on input was one byte So, main problem is there how is (or rather how should be) defined API of Binary schema. If you define API in way that input is expected and must be in binary 8bit, then code should look like: if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
die "Input is not 8-bit";
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
# Only 7-bit ascii, base64 encoding is not needed
return;
}
# now apply base64 encoding
... Plus there is a still small problem, 64bit perl supports also characters above 0x10FFFF up to the 2^64-1. They are marked as Extended Perl Unicode and are not portable. But in Perl you can create them and use them. So to be fully precise, you should check for non-8-bit as: unless ($binary =~ m/^[\x00-\xFF]*$/) {
die "Input is not 8-bit";
}
This should work fine. Alternative solution how to handle types when dumping perl structure to typed-formats (YAML/JSON/...) is to provide types explicitly via additional argument / interface. This is for example implemented in Cpanel::JSON::XS, you can look at examples in documentation: https://metacpan.org/pod/Cpanel::JSON::XS::Type |
Based on above description, here is my proposed change with API that binary schema expects 8-bit data: diff --git a/lib/YAML/PP/Schema/Binary.pm b/lib/YAML/PP/Schema/Binary.pm
index 30b4491..a63fa64 100644
--- a/lib/YAML/PP/Schema/Binary.pm
+++ b/lib/YAML/PP/Schema/Binary.pm
@@ -27,15 +27,15 @@ sub register {
code => sub {
my ($rep, $node) = @_;
my $binary = $node->{value};
- unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) {
- # ASCII
- return;
+ if ($binary =~ m/[^\x{00}-\x{FF}]/) {
+ # non 8-bit
+ die "Input is not 8-bit binary\n";
}
- if (utf8::is_utf8($binary)) {
- # utf8
+ if ($binary =~ m/^[\x{00}-\x{7F}]*$/) {
+ # 7-bit ASCII
return;
}
- # everything else must be base64 encoded
+ # 8-bit must be base64 encoded
my $base64 = encode_base64($binary);
$node->{style} = YAML_ANY_SCALAR_STYLE;
$node->{data} = $base64;
@@ -84,10 +84,10 @@ See <https://yaml.org/type/binary.html>
By prepending a base64 encoded binary string with the C<!!binary> tag, it can
be automatically decoded when loading.
-Note that the logic for dumping is probably broken, see
-L<https://github.com/perlpunk/YAML-PP-p5/issues/28>.
+If you are using this schema, any string containing C<[\x{80}-\x{FF}]>
+(non-7-bit) will be dumped as binary.
-Suggestions welcome.
+This schema cannot be used for non-8-bit (non-binary) data.
=head1 METHODS
diff --git a/t/45.binary.t b/t/45.binary.t
index caed569..24e4b58 100644
--- a/t/45.binary.t
+++ b/t/45.binary.t
@@ -1,6 +1,7 @@
#!/usr/bin/env perl
use strict;
use warnings;
+use utf8;
use Test::More tests => 3;
use YAML::PP;
@@ -41,16 +42,15 @@ EOM
};
-my $latin1_a_umlaut = encode(latin1 => (decode_utf8 "ä"));
+my $latin1_a_umlaut = encode(latin1 => "ä");
my @tests = (
- [utf8 => "a"],
+ [ascii => "a"],
+ [ascii => "test"],
+ [ascii => "euro"],
[binary => $latin1_a_umlaut],
- [binary => "\304\244",],
- [utf8 => decode_utf8("\304\244"),],
- [binary => "a umlaut ä",],
- [utf8 => decode_utf8("a umlaut ä"),],
- [binary => "euro €",],
- [utf8 => decode_utf8("euro €"),],
+ [binary => "\304\244",],
+ [binary => encode_utf8("a umlaut ä"),],
+ [binary => encode_utf8("euro €"),],
[binary => "\303\274 \374",],
[binary => "\xC0\x80"],
[binary => "\xC0\xAF"],
@@ -59,7 +59,7 @@ my @tests = (
[binary => "\xE0\x83\xBF"],
[binary => "\xF0\x80\x83\xBF"],
[binary => "\xF0\x80\xA3\x80"],
- [binary => [$gif, decode_utf8("ä")],],
+ [binary => [$gif, encode_utf8("ä")],],
[binary => [$gif, 'foo'],],
);
|
First description:
YAML-PP-p5/lib/YAML/PP/Schema/Binary.pm
Lines 77 to 81 in 78a0506
encode_base64()
is a function: {0x00..0xFF}ᴺ →{0x2B, 0x2F, 0x30..0x39, 0x41..0x5A, 0x61..7A}ᴹSo YAML::PP::Schema::Binary obviously cannot take string which contains {0x000100..0x10FFFF}.
Binary data are defined as stream of octets, therefore from set {0x00..0xFF} like
encode_base64()
function takes it.And next implementation:
YAML-PP-p5/lib/YAML/PP/Schema/Binary.pm
Lines 29 to 39 in 78a0506
unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/)
is equivalent toif ($binary =~ m/[\x{00}-\x{7E}]/)
checks for all 7-bit ASCII characters except 7-bit ASCII\x{7F}
. Comment says that this code is forASCII
which is not truth as it isASCII
∖ {0x7F}.Next there is check
if (utf8::is_utf8($binary))
which in our case is basically:if ($binary !~ m/[\x00-\xFF]/ and (int(rand(2))%2 == 1 or $binary =~ m/[\x80-\xFF]/))
So this code always skips strings with values from set {0x000100..0x10FFFF} and then (pseudo)-randomly (depending on internal representation of scalar) also skips strings from set {0x80..0xFF}. Comment says that this code is for
utf8
, but it is not truth, see documentation and my above simplified implementation.And finally it calls
encode_base64
function. When this function is called? Strings with only {0x00..0x7E} are ignored by firstASCII
check. Then by second checks are always ignored strings which have at least one value from {0x000100..0x10FFFF}. Soencode_base64
is called when string contains at least one value0x7F
or when there is at least one value from set {0x80..0xFF} andis_utf8
(pseudo)-randomly returned false.Suggested fix
Change YAML::PP::Schema::Binary module to work only with binary data as name suggests. Therefore with strings which contains only values from set {0x00..0xFF}.
Use encode_base64() always when input string is non-7-bit ASCII, therefore contains at least one value from set {0x80..0xFF}.
Decide if Base64 encoding is really needed for strings with character 0x7F. It is 7-bit ASCII therefore in 7-bit applications it is not needed special treatment for it. But I'm not sure if YAML needs special treatment of 0x7F or not.
CC @2shortplanks Please look at it and ideally do some correction if I wrote some mistake. Some parts I simplified to make it more easier to understand.
The text was updated successfully, but these errors were encountered: