In one-liners, you can use the -C
command-line option to tell perl
that Unicode is to be expected (or not) from some part of the program
environment. All forms of the option are documented in the perlrun
man
page,
but the most interesting for our purposes is -CS
, that says that
standard input, output and error are all three expected to get
Unicode data encoded in UTF-8.
(The opposite is -C0
, that will ensure that nothing is expected to get
Unicode data by default.)
Here are some examples. Let's take character U+00FF, also known as LATIN SMALL LETTER Y WITH DIAERESIS (ÿ), and see what perl is actually outputting by inspecting its standard output with xxd hex dumper:
$ perl -E 'print "\xff"' | xxd
0000000: ff .
$ perl -CS -E 'print "\xff"' | xxd
0000000: c3bf ..
Without -CS
, perl outputs the single byte 0xff, which is not
valid UTF-8. With it, perl outputs the correct UTF-8 encoding
of the U+00FF character, which consists of the two bytes 0xc3
and 0xbf, in that order. Hey, that works!
UTF-8 encoding is of variable-length kind. Each character requires from 1 byte to 4 bytes to hold the encoded bitstream.
Byte(s) | Bits | Encoded |
---|---|---|
1 | 6..0 | 0bbbbbbb |
2 | 10..6,5..0 | 110bbbbb 10bbbbbb |
3 | 15..12,11..6,5..0 | 1110bbbb 10bbbbbb 10bbbbbb |
4 | 20..18,17..12,11..6,5..0 | 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb |
The U+00FF character needs 8 bits to the store 0xff
value.
The encoded bitstream thus requires 2 bytes:
$ perl -e 'printf "%b\n", hex "c3bf"'
1100001110111111
# 110_00011 + 10_111111
# 00011 111111
# 11 111111
$ perl -e 'printf "%b\n", hex "ff"'
11111111
Now the reverse. What does perl see when it reads the two bytes
0xc3
and 0xbf
?
$ perl -E 'print "\xc3\xbf"' | perl -E '$x = <>; printf "%d %x\n", length $x, ord $x'
2 c3
$ perl -E 'print "\xc3\xbf"' | perl -CS -E '$x = <>; printf "%d %x\n", length $x, ord $x'
1 ff
In the first case, perl was told it was getting bytes, and that's what it got: two of them, in a two-character long string.
In the second case, perl decoded those bytes as a UTF-8-encoded string, and read a string containing one character, namely ÿ.
By default perl behaves as if -C0
was given on the command-line, but
you can override that default by setting the PERL_UNICODE
environment
variable, before starting your program. (It's not going to work if you
set it once your program has started.) Note that because this variable
mimics whatever comes after the -C
on the command-line, having it
unset, set to an empty string, or set to 0, enable actually three
completely different behaviours. (This is all documented in perlrun along
the -C
option.)