April 2012 Archives

Perl Unicode Cookbook: Decode Standard Filehandles as Locale Encoding

By Tom Christiansen on April 30, 2012 6:00 AM

℞ 16: Declare `STD{IN,OUT,ERR}` to be in locale encoding

Always convert to and from your desired encoding at the edges of your programs. This includes the standard filehandles STDIN, STDOUT, and STDERR. While it may be most common for modern operating systems to support UTF-8 in filehandle settings, you may need to use other encodings.

Perl can respect your current locale settings for its default filehandles. Start by installing the Encode::Locale module from the CPAN.

    # cpan -i Encode::Locale
    use Encode;
    use Encode::Locale;

    # or as a stream for binmode or open
    binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
    binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
    binmode STDERR, ":encoding(console_out)" if -t STDERR;

The Encode::Locale module allows you to use "whatever encoding the attached terminal expects" for input and output filehandles attached to terminals. It also allows you to specify "whatever encoding the file system uses for file names"; see the documentation for more.

Previous: ℞ 15: Decode Standard Filehandles as UTF-8

Series Index: The Standard Preamble

Next: ℞ 17: Make File I/O Default to UTF-8

Perl Unicode Cookbook: Decode Standard Filehandles as UTF-8

By Tom Christiansen on April 27, 2012 6:00 AM

℞ 15: Declare `STD{IN,OUT,ERR}` to be UTF-8

Always convert to and from your desired encoding at the edges of your programs. This includes the standard filehandles STDIN, STDOUT, and STDERR.

As documented in perldoc perlrun, the PERL_UNICODE environment variable or the -C command-line flag allow you to tell Perl to encode and decode from and to these filehandles as UTF-8, with the S option:

     $ perl -CS ...
     # or
     $ export PERL_UNICODE=S

Within your program, the open pragma allows you to set the default encoding of these filehandles all at once:

     use open qw(:std :utf8);

Because Perl uses IO layers to implement encoding and decoding, you may also use the binmode operator on filehandles directly:

     binmode(STDIN,  ":utf8");
     binmode(STDOUT, ":utf8");
     binmode(STDERR, ":utf8");

Previous: ℞ 14: Decode @ARGV as Local Encoding

Series Index: The Standard Preamble

Next: ℞ 16: Decode Standard Filehandles as Locale Encoding

Perl Unicode Cookbook: Decode @ARGV as Local Encoding

By Tom Christiansen on April 26, 2012 6:00 AM

℞ 14: Decode program arguments as locale encoding

While it may be most common in modern operating systems for your command-line arguments to be encoded as UTF-8, @ARGV may use other encodings. If you have configured your system with a proper locale, you may need to decode @ARGV appropriately. Unlike automatic UTF-8 @ARGV decoding, you must do this manually.

Install the Encode::Locale module from the CPAN:

    # cpan -i Encode::Locale
    use Encode qw(locale);
    use Encode::Locale;

    # use "locale" as an arg to encode/decode
    @ARGV = map { decode(locale => $_, 1) } @ARGV;

Previous: ℞ 13: Decode @ARGV as UTF-8

Series Index: The Standard Preamble

Next: ℞ 15: Decode Standard Filehandles as UTF-8

Perl Unicode Cookbook: Decode @ARGV as UTF-8

By Tom Christiansen on April 24, 2012 6:00 AM

℞ 13: Decode program arguments as utf8

While the standard Perl Unicode preamble makes Perl's filehandles use UTF-8 encoding by default, filehandles aren't the only sources and sinks of data. The command-line arguments to your programs, available through @ARGV, may also need decoding.

You can have Perl handle this operation for you automatically in two ways, and may do it yourself manually. As documented in perldoc perlrun, the -C flag controls Unicode features. Use the A modifier for Perl to treat your arguments as UTF-8 strings:

     $ perl -CA ...

You may, of course, use -C on the shebang line of your programs.

The second approach is to use the PERL_UNICODE environment variable. It takes the same values as the -C flag; to get the same effect as -CA, write:

     $ export PERL_UNICODE=A

You may temporarily disable this automatic Unicode treatment with PERL_UNICODE=0.

Finally, you may decode the contents of @ARGV yourself manually with the Encode module:

    use Encode qw(decode_utf8);
    @ARGV = map { decode_utf8($_, 1) } @ARGV;

Previous: ℞ 12: Explicit encode/decode

Series Index: The Standard Preamble

Next: ℞ 14: Decode @ARGV as Local Encoding

Perl Unicode Cookbook: Explicit encode/decode

By Tom Christiansen on April 23, 2012 6:00 AM

℞ 12: Explicit encode/decode

While the standard Perl Unicode preamble makes Perl's filehandles use UTF-8 encoding by default, filehandles aren't the only sources and sinks of data. On rare occasions, such as a database read, you may be given encoded text you need to decode.

The core Encode module offers two functions to handle these conversions. (Remember that decode() means to convert octets from a known encoding into Perl's internal Unicode form and encode() means to convet from Perl's internal form into a known encoding.)

  use Encode qw(encode decode);

  # given $bytes, containing octets in a known encoding
  my $chars = decode("shiftjis", $bytes, 1);

  # given $chars, a string encoded in Perl's internal format
  my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);

For streams all in the same encoding, don't use encode/decode; instead set the ﬁle encoding when you open the ﬁle or immediately after with binmode as described in a future reference. Remember the canonical rule of Unicode: always encode/decode at the edges of your application.

Previous: ℞ 11: Names of CJK Codepoints

Series Index: The Standard Preamble

Next: ℞ 13: Decode @ARGV as UTF-8

Perl Unicode Cookbook: Names of CJK Codepoints

By Tom Christiansen on April 20, 2012 6:00 AM

℞ 11: Names of CJK codepoints

CJK refers to Chinese, Japanese, and Korean. In the context of Unicode, it usually refers to the Han ideographs used in the modern Chinese and Japanese writing systems. As you can expect, pictoral languages such as Chinese make Unicode handling more complex.

Sinograms like "東京" come back with character names of CJK UNIFIED IDEOGRAPH-6771 and CJK UNIFIED IDEOGRAPH-4EAC, because their "names" vary between languages. The CPAN Unicode::Unihan module has a large database for decoding these (and a whole lot more), provided you know how to understand its output.

 # cpan -i Unicode::Unihan
 use Unicode::Unihan;
 my $str   = "東京";
 my $unhan = Unicode::Unihan->new;
 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
     printf "CJK $str in %-12s is ", $lang;
     say $unhan->$lang($str);
 }

prints:

 CJK 東京 in Mandarin     is DONG1JING1
 CJK 東京 in Cantonese    is dung1ging1
 CJK 東京 in Korean       is TONGKYENG
 CJK 東京 in JapaneseOn   is TOUKYOU KEI KIN
 CJK 東京 in JapaneseKun  is HIGASHI AZUMAMIYAKO

If you have a speciﬁc romanization scheme in mind, use the speciﬁc module:

 # cpan -i Lingua::JA::Romanize::Japanese
 use Lingua::JA::Romanize::Japanese;
 my $k2r = Lingua::JA::Romanize::Japanese->new;
 my $str = "東京";
 say "Japanese for $str is ", $k2r->chars($str);

prints:

 Japanese for 東京 is toukyou

Previous: ℞ 10: Custom Named Characters

Series Index: The Standard Preamble

Next: ℞ 12: Explicit encode/decode

Perl Unicode Cookbook: Custom Named Characters

By Tom Christiansen on April 19, 2012 6:00 AM

℞ 10: Custom named characters

As several other recipes demonstrate, the charnames pragma offers considerable power to use and manipulate Unicode characters by their names. Its :alias option allows you to give your own lexically scoped nicknames to existing characters, or even to give unnamed private-use characters useful names:

 use charnames ":full", ":alias" => {
     ecute => "LATIN SMALL LETTER E WITH ACUTE",
     "APPLE LOGO" => 0xF8FF, # private use character
 };

 "\N{ecute}"
 "\N{APPLE LOGO}"

You may even override existing names (lexically, of course) with different characters.

This feature has some limitations. For best effect, aliases should hew to the rules of ASCII identifiers and must not resemble regex quantifiers. You can only alias one character at a time; other options exist to give a character sequence an alias.

As always, the documentation of the charnames pragma offers more details.

Previous: ℞ 9: Unicode Named Character Sequences

Series Index: The Standard Preamble

Next: ℞ 11: Names of CJK Codepoints

Perl Unicode Cookbook: Unicode Named Character Sequences

By Tom Christiansen on April 17, 2012 6:00 AM

℞ 9: Unicode named sequences

Unicode includes the feature of named character sequences, which combine multiple Unicode characters behind a single name. The charnames pragma allows the use of these named sequences in literals, just as it allows the use of Unicode named characters in literals.

In Perl, these named character sequences look just like character names but return multiple codepoints. Notice the %vx vector-print behavior of printf:

use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300

While each version of Unicode may update the official list of named sequences, the latest version of the Unicode Named Sequences data file is always available. Perl 5.14 supports Unicode 6.0, and Perl 5.16 will support Unicode 6.1.

Previous: ℞ 8: Unicode Named Characters

Series Index: The Standard Preamble

Next: ℞ 10: Custom Named Characters

Perl Unicode Cookbook: Unicode Named Characters

By Tom Christiansen on April 16, 2012 6:00 AM

℞ 8: Unicode named characters

Use the \N{charname} notation to get the character by that name for use in interpolated literals (double-quoted strings and regexes). In v5.16, there is an implicit

 use charnames qw(:full :short);

But prior to v5.16, you must be explicit about which set of charnames you want. The :full names are the oﬃcial Unicode character name, alias, or sequence, which all share a namespace.

 use charnames qw(:full :short latin greek);

 "\N{MATHEMATICAL ITALIC SMALL N}"      # :full
 "\N{GREEK CAPITAL LETTER SIGMA}"       # :full

Anything else is a Perl-speciﬁc convenience abbreviation. Specify one or more scripts by names if you want short names that are script-speciﬁc.

 "\N{Greek:Sigma}"                      # :short
 "\N{ae}"                               #  latin
 "\N{epsilon}"                          #  greek

The v5.16 release also supports a :loose import for loose matching of character names, which works just like loose matching of property names: that is, it disregards case, whitespace, and underscores:

 "\N{euro sign}"                        # :loose (from v5.16)

(You do not have to use the charnames pragma to interpolate Unicode characters by number into literals with the \N{...} sequence.)

Previous: ℞ 7: Get Character Number by Name

Series Index: The Standard Preamble

Next: ℞ 9: Unicode Named Character Sequences

Perl Unicode Cookbook: Get Character Number by Name

By Tom Christiansen on April 13, 2012 6:00 AM

℞ 7: Get character number by name

Unicode allows you to refer to characters by number or by name. Computers don't care, but humans do. When you have a character name, you can translate it to its number with the charnames pragma:

 use charnames ();
my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");

This is, of course, the opposite of Get Character Names by Number.

See Characters and Their Numbers to translate from this number to the appropriate character.

Previous: ℞ 6: Get Character Names by Number

Series Index: The Standard Preamble

Next: ℞ 8: Unicode Named Characters

Perl Unicode Cookbook: Get Character Names by Number

By Tom Christiansen on April 12, 2012 6:00 AM

℞ 6: Get character name by number

Unicode allows you to refer to characters by number or by name. Computers don't care, but humans do. When you have a character number, you can translate it to its name with the charnames pragma:

use charnames ();
my $name = charnames::viacode(0x03A3);

charnames::viacode() returns the full Unicode name of the given codepoint—in this case, GREEK CAPITAL LETTER SIGMA. You may embed this as a literal string in your source code as "\N{GREEK CAPITAL LETTER SIGMA}".

Use charnames::string_vianame() to convert a Unicode name to the appropriate Unicode character during runtime.

Previous: ℞ 5: Unicode Literals by Number

Series Index: The Standard Preamble

Next: ℞ 7: Get Character Number by Name

Perl Unicode Cookbook: Unicode Literals by Number

By Tom Christiansen on April 10, 2012 6:00 AM

℞ 5: Unicode literals by character number

In an interpolated literal, whether a double-quoted string or a regex, you may specify a character by its number using the \x{HHHHHH} escape.

 String: "\x{3a3}"
 Regex:  /\x{3a3}/

 String: "\x{1d45b}"
 Regex:  /\x{1d45b}/

 # even non-BMP ranges in regex work fine
 /[\x{1D434}-\x{1D467}]/

The BMP (or Basic Multilingual Plane, or Plane 0) contains the most common Unicode characters; it covers 0x0000 through 0xFFFD. Characters in other planes are much more specialized. They often include characters of historical interest.

Use Unicode charts to find character numbers, or see the recipe for translating characters to numbers and vice versa.

Previous: ℞ 4: Characters and Their Numbers

Series Index: The Standard Preamble

Next: ℞ 6: Get Character Names by Number

Perl Unicode Cookbook: Characters and Their Numbers

By Tom Christiansen on April 9, 2012 6:00 AM

℞ 4: Characters and their numbers

Do you need to translate a codepoint to a character or a character to its codepoint? The ord and chr functions work transparently on all codepoints, not just on ASCII alone—nor in fact, not even just on Unicode alone.

 # ASCII characters
 ord("A")
 chr(65)

 # characters from the Basic Multilingual Plane
 ord("Σ")
 chr(0x3A3)

 # beyond the BMP
 ord("𝑛")               # MATHEMATICAL ITALIC SMALL N
 chr(0x1D45B)

 # beyond Unicode! (up to MAXINT)
 ord("\x{20_0000}")
 chr(0x20_0000)

(Remember to enable the standard Perl Unicode preamble to use UTF-8 in literal strings in your source code and to encode output properly.)

Previous: ℞ 3: Enable UTF-8 Literals

Series Index: The Standard Preamble

Next: ℞ 5: Unicode Literals by Number

Perl Unicode Cookbook: Enable UTF-8 Literals

By Tom Christiansen on April 6, 2012 6:00 AM

℞ 3: Declare source in UTF-8 for identiﬁers and literals

Without the all-critical use utf8 declaration, putting UTF‑8 in your literals and identiﬁers won't work right. If you used the standard Perl Unicode preamble, this already happened. If you did, you can do things like this:

use utf8;

 my $measure   = "Ångström";
 my @μsoft     = qw( cp852 cp1251 cp1252 );
 my @ὑπέρμεγας = qw( ὑπέρ  μεγας );
 my @鯉        = qw( koi8-f koi8-u koi8-r );
 my $motto     = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL

If you forget use utf8, high bytes will be misunderstood as separate characters, and nothing will work right. Remember that this pragma only affects the interpretation of literal UTF-8 in your source code.

Previous: ℞ 2: Fine-Tuning Unicode Warnings

Series Index: The Standard Preamble

Next: ℞ 4: Characters and Their Numbers

Perl Unicode Cookbook: Fine-Tuning Unicode Warnings

By Tom Christiansen on April 5, 2012 6:00 AM

℞ 2: Fine-tuning Unicode warnings

It's easy to get Unicode wrong, especially when handling user input and dealing with multiple encodings. Perl is happy to help you detect unexpected conditions of your data. Perl is also happy to let you decide if these unexpected conditions are worth warning about.

As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings. While the utf8 lexical warning category existed prior to 5.14, you may now handle these warnings individually:

 use v5.14;                  # subwarnings unavailable any earlier
 no warnings "nonchar";      # the 66 forbidden non-characters
 no warnings "surrogate";    # UTF-16/CESU-8 nonsense
 no warnings "non_unicode";  # for codepoints over 0x10_FFFF

Previous: ℞ 1: Always Decompose and Recompose

Series Index: The Standard Preamble

Next: ℞ 3: Enable UTF-8 Literals

Perl Unicode Cookbook: Always Decompose and Recompose

By Tom Christiansen on April 3, 2012 6:00 AM

℞ 1: Generic Unicode-savvy ﬁlter

Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.

To handle Unicode effectively, always decompose on the way in, then recompose on the way out.

 use Unicode::Normalize;

 while (<>) {
     $_ = NFD($_);   # decompose + reorder canonically
     ...
 } continue {
     print NFC($_);  # recompose (where possible) + reorder canonically
 }

See the Unicode Normalization FAQ for more details.

Series Index: The Standard Preamble

Next: ℞ 2: Fine-Tuning Unicode Warnings

Perl Unicode Cookbook: The Standard Preamble

By Tom Christiansen on April 2, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. This is the first recipe in the series.

℞ 0: Standard preamble

Unless otherwise noted, all examples in this cookbook require this standard preamble to work correctly, with the #! adjusted to work on your system:

 #!/usr/bin/env perl

 use utf8;      # so literals and identifiers can be in UTF-8
 use v5.12;     # or later to get "unicode_strings" feature
 use strict;    # quote strings, declare variables
 use warnings;  # on by default
 use warnings  qw(FATAL utf8);    # fatalize encoding glitches
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

This does make even Unix programmers binmode your binary streams, or open them with :raw, but that's the only way to get at them portably anyway.

WARNING: use autodie and use open do not get along with each other.

This combination of features sets Perl to a known state of Unicode compatibility and strictness, so that subsequent operations behave as you expect.

The other recipes in this cookbook are:

℞ 0: The Standard Preamble
℞ 1: Always Decompose and Recompose
℞ 2: Fine-Tuning Unicode Warnings
℞ 3: Enable UTF-8 Literals
℞ 4: Characters and Their Numbers
℞ 5: Unicode Literals by Number
℞ 6: Get Character Names by Number
℞ 7: Get Character Number by Name
℞ 8: Unicode Named Characters
℞ 9: Unicode Named Character Sequences
℞ 10: Custom Named Characters
℞ 11: Names of CJK Codepoints
℞ 12: Explicit encode/decode
℞ 13: Decode @ARGV as UTF-8
℞ 14: Decode @ARGV as Local Encoding
℞ 15: Decode Standard Filehandles as UTF-8
℞ 16: Decode Standard Filehandles as Locale Encoding
℞ 17: Make File I/O Default to UTF-8
℞ 18: Make All I/O Default to UTF-8
℞ 19: Specify a File's Encoding
℞ 20: Unicode Casing
℞ 21: Case-insensitive Comparisons
℞ 22: Match Unicode Linebreak Sequence
℞ 23: Get Character Categories
℞ 24: Disable Unicode-awareness in Builtin Character Classes
℞ 25: Match Unicode Properties in Regex
℞ 26: Custom Character Properties
℞ 27: Unicode Normalization
℞ 28: Convert non-ASCII Unicode Numerics
℞ 29: Match Unicode Grapheme Cluster in Regex
℞ 30: Extract by Grapheme Instead of Codepoint (regex)
℞ 31: Extract by Grapheme Instead of Codepoint (substr)
℞ 32: Reverse String by Grapheme
℞ 33: String Length in Graphemes
℞ 34: Unicode Column Width for Printing
℞ 35: Unicode Collation
℞ 36: Case- and Accent-insensitive Sorting
℞ 37: Unicode Locale Collation
℞ 38: Make cmp Work on Text instead of Codepoints
℞ 39: Case- and Accent-insensitive Comparison
℞ 40: Case- and Accent-insensitive Locale Comparisons
℞ 41: Unicode Linebreaking
℞ 42: Unicode Text in Stubborn Libraries
℞ 43: Unicode Text in DBM Files (the easy way)
℞ 44: Demo of Unicode Collation and Printing
℞ 45: Further Resources

« February 2012 | Main Index | Archives | May 2012 »