|
perlunicode - Unicode support in Perl
Unicode support is an extensive requirement. While Perl does not
implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
People who want to learn to use Unicode in Perl, should probably read
the Perl Unicode tutorial before reading this reference
document.
- Input and Output Layers
-
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
the ":utf8" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See the open manpage.
-
To indicate that Perl source itself is in UTF-8, use use utf8;.
- Regular Expressions
-
The regular expression compiler produces polymorphic opcodes. That is,
the pattern adapts to the data and automatically switches to the Unicode
character scheme when presented with data that is internally encoded in
UTF-8 -- or instead uses a traditional byte scheme when presented with
byte data.
use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts
-
As a compatibility measure, the use utf8 pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts themselves
(in string or regular expression literals, or in identifier names) on
ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines. These are the only times when an explicit use utf8
is needed. See the utf8 manpage.
- BOM-marked scripts and UTF-16 scripts autodetected
-
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(BOMless UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)
use encoding needed to upgrade non-Latin-1 byte strings
-
By default, there is a fundamental asymmetry in Perl's Unicode model:
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
-
See Byte and Character Semantics for more details.
Beginning with version 5.6, Perl uses logically-wide characters to
represent strings internally.
In future, Perl-level operations will be expected to work with
characters rather than bytes.
However, as an interim compatibility measure, Perl aims to
provide a safe migration path from byte semantics to character
semantics for programs. For operations where Perl can unambiguously
decide that the input data are characters, Perl switches to
character semantics. For operations where this determination cannot
be made without additional information from the user, Perl decides in
favor of compatibility and chooses to use byte semantics.
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
none of the program's inputs were marked as being as source of Unicode
character data. Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
The bytes pragma will always, regardless of platform, force byte
semantics in a particular lexical scope. See the bytes manpage.
The utf8 pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
semantics; when character semantics become the default, this pragma
may become a no-op. See the utf8 manpage.
Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data.
The decision to use character semantics is made transparently. If
input data comes from a Unicode source--for example, if a character
encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The bytes pragma should
be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will be created by
decoding the byte strings as ISO 8859-1 (Latin-1), even if the
old Unicode string used EBCDIC. This translation is done without
regard to the system's native 8-bit encoding.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
logically just a number ranging from 0 to 2**31 or so. Larger
characters may encode into longer sequences of bytes internally, but
this internal detail is mostly hidden for Perl code.
See the perluniintro manpage for more.
Character semantics have the following effects:
-
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
If you use a Unicode editor to edit your program, Unicode characters may
occur directly within the literal strings in UTF-8 encoding, or UTF-16.
(The former requires a BOM or use utf8, the latter requires a BOM.)
Unicode characters can also be added to a string by using the \x{...}
notation. The Unicode code for the desired character, in hexadecimal,
should be placed in the braces. For instance, a smiley face is
\x{263A}. This encoding scheme only works for all characters, but
for characters under 0x100, note that Perl may use an 8 bit encoding
internally, for optimization and/or backward compatibility.
Additionally, if you
use charnames ':full';
you can use the \N{...} notation and put the official Unicode
character name within the braces, such as \N{WHITE SMILING FACE}.
-
If an appropriate the encoding manpage is specified, identifiers within the
Perl script may contain Unicode alphanumeric characters, including
ideographs. Perl does not currently attempt to canonicalize variable
names.
-
Regular expressions match characters instead of bytes. "." matches
a character instead of a byte.
-
Character classes in regular expressions match characters instead of
bytes and match against the character properties specified in the
Unicode properties database. \w can be used to match a Japanese
ideograph, for instance.
-
Named Unicode properties, scripts, and block ranges may be used like
character classes via the \p{} "matches property" construct and
the \P{} negation, "doesn't match property".
See Unicode Character Properties for more details.
You can define your own character properties and use them
in the regular expression with the \p{} or \P{} construct.
See User-Defined Character Properties for more details.
-
The special pattern \X matches any extended Unicode
sequence--"a combining character sequence" in Standardese--where the
first character is a base character and subsequent characters are mark
characters that apply to the base character. \X is equivalent to
(?:\PM\pM*).
-
The tr/// operator translates characters instead of bytes. Note
that the tr///CU functionality has been removed. For similar
functionality see pack('U0', ...) and pack('C0', ...).
-
Case translation operators use the Unicode case translation tables
when character input is provided. Note that uc(), or \U in
interpolated strings, translates to uppercase, while ucfirst,
or \u in interpolated strings, translates to titlecase in languages
that make the distinction.
-
Most operators that deal with positions or lengths in a string will
automatically switch to using character positions, including
chop(), chomp(), substr(), pos(), index(), rindex(),
sprintf(), write(), and length(). An operator that
specifically does not switch is vec(). Operators that really don't
care include operators that treat strings as a bucket of bits such as
sort(), and operators dealing with filenames.
-
The pack()/unpack() letter C does not change, since it is often
used for byte-oriented formats. Again, think char in the C language.
There is a new U specifier that converts between Unicode characters
and code points. There is also a W specifier that is the equivalent of
chr/ord and properly handles character values even if they are above 255.
-
The chr() and ord() functions work on characters, similar to
pack("W") and unpack("W"), not pack("C") and
unpack("C"). pack("C") and unpack("C") are methods for
emulating byte-oriented chr() and ord() on Unicode strings.
While these methods reveal the internal encoding of Unicode strings,
that is not something one normally needs to care about at all.
-
The bit string operators, & | ^ ~, can operate on character data.
However, for backward compatibility, such as when using bit string
operations when characters are all less than 256 in ordinal value, one
should not use ~ (the bit complement) with characters of both
values less than 256 and values greater than 256. Most importantly,
DeMorgan's laws (~($x|$y) eq ~$x&~$y and ~($x&$y) eq ~$x|~$y)
will not hold. The reason for this mathematical faux pas is that
the complement cannot return both the 8-bit (byte-wide) bit
complement and the full character-wide bit complement.
-
lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
-
the case mapping is from a single Unicode character to another
single Unicode character, or
-
the case mapping is from a single Unicode character to more
than one Unicode character.
Things to do with locales (Lithuanian, Turkish, Azeri) do not work
since Perl does not understand the concept of Unicode locales.
See the Unicode Technical Report #21, Case Mappings, for more details.
But you can also define your own mappings to be used in the lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
See User-Defined Case Mappings for more details.
Named Unicode properties, scripts, and block ranges may be used like
character classes via the \p{} "matches property" construct and
the \P{} negation, "doesn't match property".
For instance, \p{Lu} matches any character with the Unicode "Lu"
(Letter, uppercase) property, while \p{M} matches any character
with an "M" (mark--accents and such) property. Brackets are not
required for single letter properties, so \p{M} is equivalent to
\pM. Many predefined properties are available, such as
\p{Mirrored} and \p{Tibetan}.
The official Unicode script and block names have spaces and dashes as
separators, but for convenience you can use dashes, spaces, or
underbars, and case is unimportant. It is recommended, however, that
for consistency you use the following naming: the official Unicode
script, property, or block name (see below for the additional rules
that apply to block names) with whitespace and dashes removed, and the
words "uppercase-first-lowercase-rest". Latin-1 Supplement thus
becomes Latin1Supplement.
You can also use negation in both \p{} and \P{} by introducing a caret
(^) between the first brace and the property name: \p{^Tamil} is
equal to \P{Tamil}.
NOTE: the properties, scripts, and blocks listed here are as of
Unicode 5.0.0 in July 2006.
- General Category
-
Here are the basic Unicode General Category properties, followed by their
long form. You can use either; \p{Lu} and \p{UppercaseLetter},
for instance, are identical.
-
Short Long
-
L Letter
LC CasedLetter
Lu UppercaseLetter
Ll LowercaseLetter
Lt TitlecaseLetter
Lm ModifierLetter
Lo OtherLetter
-
M Mark
Mn NonspacingMark
Mc SpacingMark
Me EnclosingMark
-
N Number
Nd DecimalNumber
Nl LetterNumber
No OtherNumber
-
P Punctuation
Pc ConnectorPunctuation
Pd DashPunctuation
Ps OpenPunctuation
Pe ClosePunctuation
Pi InitialPunctuation
(may behave like Ps or Pe depending on usage)
Pf FinalPunctuation
(may behave like Ps or Pe depending on usage)
Po OtherPunctuation
-
S Symbol
Sm MathSymbol
Sc CurrencySymbol
Sk ModifierSymbol
So OtherSymbol
-
Z Separator
Zs SpaceSeparator
Zl LineSeparator
Zp ParagraphSeparator
-
C Other
Cc Control
Cf Format
Cs Surrogate (not usable)
Co PrivateUse
Cn Unassigned
-
Single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
LC and L& are special cases, which are aliases for the set of
Ll, Lu, and Lt.
-
Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates. Cs is therefore not
supported.
- Bidirectional Character Types
-
Because scripts differ in their directionality--Hebrew is
written right to left, for example--Unicode supplies these properties in
the BidiClass class:
-
Property Meaning
-
L Left-to-Right
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
AL Right-to-Left Arabic
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
ES European Number Separator
ET European Number Terminator
AN Arabic Number
CS Common Number Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals
-
For example, \p{BidiClass:R} matches characters that are normally
written right to left.
- Scripts
-
The script names which can be used by \p{...} and \P{...},
such as in \p{Latin} or \p{Cyrillic}, are as follows:
-
Arabic
Armenian
Balinese
Bengali
Bopomofo
Braille
Buginese
Buhid
CanadianAboriginal
Cherokee
Coptic
Cuneiform
Cypriot
Cyrillic
Deseret
Devanagari
Ethiopic
Georgian
Glagolitic
Gothic
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hebrew
Hiragana
Inherited
Kannada
Katakana
Kharoshthi
Khmer
Lao
Latin
Limbu
LinearB
Malayalam
Mongolian
Myanmar
NewTaiLue
Nko
Ogham
OldItalic
OldPersian
Oriya
Osmanya
PhagsPa
Phoenician
Runic
Shavian
Sinhala
SylotiNagri
Syriac
Tagalog
Tagbanwa
TaiLe
Tamil
Telugu
Thaana
Thai
Tibetan
Tifinagh
Ugaritic
Yi
- Extended property classes
-
Extended property classes can supplement the basic
properties, defined by the PropList Unicode database:
-
ASCIIHexDigit
BidiControl
Dash
Deprecated
Diacritic
Extender
HexDigit
Hyphen
Ideographic
IDSBinaryOperator
IDSTrinaryOperator
JoinControl
LogicalOrderException
NoncharacterCodePoint
OtherAlphabetic
OtherDefaultIgnorableCodePoint
OtherGraphemeExtend
OtherIDStart
OtherIDContinue
OtherLowercase
OtherMath
OtherUppercase
PatternSyntax
PatternWhiteSpace
QuotationMark
Radical
SoftDotted
STerm
TerminalPunctuation
UnifiedIdeograph
VariationSelector
WhiteSpace
-
and there are further derived properties:
-
Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
Lowercase = Ll + OtherLowercase
Uppercase = Lu + OtherUppercase
Math = Sm + OtherMath
-
IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
-
DefaultIgnorableCodePoint
= OtherDefaultIgnorableCodePoint
+ Cf + Cc + Cs + Noncharacters + VariationSelector
- WhiteSpace - FFF9..FFFB (Annotation Characters)
-
Any = Any code points (i.e. U+0000 to U+10FFFF)
Assigned = Any non-Cn code points (i.e. synonym for \P{Cn})
Unassigned = Synonym for \p{Cn}
ASCII = ASCII (i.e. U+0000 to U+007F)
-
Common = Any character (or unassigned code point)
not explicitly assigned to a script
- Use of "Is" Prefix
-
For backward compatibility (with Perl 5.6), all properties mentioned
so far may have Is prepended to their name, so \P{IsLu}, for
example, is equal to \P{Lu}.
- Blocks
-
In addition to scripts, Unicode also defines blocks of
characters. The difference between scripts and blocks is that the
concept of scripts is closer to natural languages, while the concept
of blocks is more of an artificial grouping based on groups of 256
Unicode characters. For example, the Latin script contains letters
from many blocks but does not contain all the characters from those
blocks. It does not, for example, contain digits, because digits are
shared across many scripts. Digits and similar groups, like
punctuation, are in a category called Common.
-
For more about scripts, see the UAX#24 "Script Names":
-
http://www.unicode.org/reports/tr24/
-
For more about blocks, see:
-
http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
Block names are given with the In prefix. For ex |