Skip to content

Latest commit

 

History

History
79 lines (62 loc) · 3.25 KB

03-dash-C-CLI.mkdn

File metadata and controls

79 lines (62 loc) · 3.25 KB

The -C command-line option

In one-liners, you can use the -C command-line option to tell perl that Unicode is to be expected (or not) from some part of the program environment. All forms of the option are documented in the perlrun man page, but the most interesting for our purposes is -CS, that says that standard input, output and error are all three expected to get Unicode data encoded in UTF-8. (The opposite is -C0, that will ensure that nothing is expected to get Unicode data by default.)

Output

Here are some examples. Let's take character U+00FF, also known as LATIN SMALL LETTER Y WITH DIAERESIS (ÿ), and see what perl is actually outputting by inspecting its standard output with xxd hex dumper:

$ perl -E 'print "\xff"' | xxd
0000000: ff                                       .
$ perl -CS -E 'print "\xff"' | xxd
0000000: c3bf                                     ..

Without -CS, perl outputs the single byte 0xff, which is not valid UTF-8. With it, perl outputs the correct UTF-8 encoding of the U+00FF character, which consists of the two bytes 0xc3 and 0xbf, in that order. Hey, that works!

UTF-8

UTF-8 encoding is of variable-length kind. Each character requires from 1 byte to 4 bytes to hold the encoded bitstream.

Byte(s) Bits Encoded
1 6..0 0bbbbbbb
2 10..6,5..0 110bbbbb 10bbbbbb
3 15..12,11..6,5..0 1110bbbb 10bbbbbb 10bbbbbb
4 20..18,17..12,11..6,5..0 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb

The U+00FF character needs 8 bits to the store 0xff value. The encoded bitstream thus requires 2 bytes:

$ perl -e 'printf "%b\n", hex "c3bf"'
1100001110111111
# 110_00011 + 10_111111
#     00011      111111
#        11      111111
$ perl -e 'printf "%b\n", hex "ff"'
11111111

Input

Now the reverse. What does perl see when it reads the two bytes 0xc3 and 0xbf?

$ perl -E 'print "\xc3\xbf"' | perl -E '$x = <>; printf "%d %x\n", length $x, ord $x'
2 c3
$ perl -E 'print "\xc3\xbf"' | perl -CS -E '$x = <>; printf "%d %x\n", length $x, ord $x'
1 ff

In the first case, perl was told it was getting bytes, and that's what it got: two of them, in a two-character long string.

In the second case, perl decoded those bytes as a UTF-8-encoded string, and read a string containing one character, namely ÿ.

The PERL_UNICODE environment variable

By default perl behaves as if -C0 was given on the command-line, but you can override that default by setting the PERL_UNICODE environment variable, before starting your program. (It's not going to work if you set it once your program has started.) Note that because this variable mimics whatever comes after the -C on the command-line, having it unset, set to an empty string, or set to 0, enable actually three completely different behaviours. (This is all documented in perlrun along the -C option.)