|
perlunicode - Unicode support in Perl
Unicode support is an extensive requirement. While Perl does not
implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
People who want to learn to use Unicode in Perl, should probably read
the Perl Unicode tutorial before reading this reference
document.
- Input and Output Layers
-
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
the ":utf8" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See the open manpage.
-
To indicate that Perl source itself is in UTF-8, use use utf8;.
- Regular Expressions
-
The regular expression compiler produces polymorphic opcodes. That is,
the pattern adapts to the data and automatically switches to the Unicode
character scheme when presented with data that is internally encoded in
UTF-8 -- or instead uses a traditional byte scheme when presented with
byte data.
use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts
-
As a compatibility measure, the use utf8 pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts themselves
(in string or regular expression literals, or in identifier names) on
ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines. These are the only times when an explicit use utf8
is needed. See the utf8 manpage.
- BOM-marked scripts and UTF-16 scripts autodetected
-
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(BOMless UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)
use encoding needed to upgrade non-Latin-1 byte strings
-
By default, there is a fundamental asymmetry in Perl's unicode model:
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
-
See Byte and Character Semantics for more details.
Beginning with version 5.6, Perl uses logically-wide characters to
represent strings internally.
In future, Perl-level operations will be expected to work with
characters rather than bytes.
However, as an interim compatibility measure, Perl aims to
provide a safe migration path from byte semantics to character
semantics for programs. For operations where Perl can unambiguously
decide that the input data are characters, Perl switches to
character semantics. For operations where this determination cannot
be made without additional information from the user, Perl decides in
favor of compatibility and chooses to use byte semantics.
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
none of the program's inputs were marked as being as source of Unicode
character data. Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
The bytes pragma will always, regardless of platform, force byte
semantics in a particular lexical scope. See the bytes manpage.
The utf8 pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
semantics; when character semantics become the default, this pragma
may become a no-op. See the utf8 manpage.
Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data.
The decision to use character semantics is made transparently. If
input data comes from a Unicode source--for example, if a character
encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The bytes pragma should
be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will be created by
decoding the byte strings as ISO 8859-1 (Latin-1), even if the
old Unicode string used EBCDIC. This translation is done without
regard to the system's native 8-bit encoding.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
logically just a number ranging from 0 to 2**31 or so. Larger
characters may encode into longer sequences of bytes internally, but
this internal detail is mostly hidden for Perl code.
See the perluniintro manpage for more.
Character semantics have the following effects:
-
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
If you use a Unicode editor to edit your program, Unicode characters may
occur directly within the literal strings in UTF-8 encoding, or UTF-16.
(The former requires a BOM or use utf8, the latter requires a BOM.)
Unicode characters can also be added to a string by using the \x{...}
notation. The Unicode code for the desired character, in hexadecimal,
should be placed in the braces. For instance, a smiley face is
\x{263A}. This encoding scheme only works for all characters, but
for characters under 0x100, note that Perl may use an 8 bit encoding
internally, for optimization and/or backward compatibility.
Additionally, if you
use charnames ':full';
you can use the \N{...} notation and put the official Unicode
character name within the braces, such as \N{WHITE SMILING FACE}.
-
If an appropriate the encoding manpage is specified, identifiers within the
Perl script may contain Unicode alphanumeric characters, including
ideographs. Perl does not currently attempt to canonicalize variable
names.
-
Regular expressions match characters instead of bytes. "." matches
a character instead of a byte.
-
Character classes in regular expressions match characters instead of
bytes and match against the character properties specified in the
Unicode properties database. \w can be used to match a Japanese
ideograph, for instance.
-
Named Unicode properties, scripts, and block ranges may be used like
character classes via the \p{} "matches property" construct and
the \P{} negation, "doesn't match property".
See Unicode Character Properties for more details.
You can define your own character properties and use them
in the regular expression with the \p{} or \P{} construct.
See User-Defined Character Properties for more details.
-
The special pattern \X matches any extended Unicode
sequence--"a combining character sequence" in Standardese--where the
first character is a base character and subsequent characters are mark
characters that apply to the base character. \X is equivalent to
(?:\PM\pM*).
-
The tr/// operator translates characters instead of bytes. Note
that the tr///CU functionality has been removed. For similar
functionality see pack('U0', ...) and pack('C0', ...).
-
Case translation operators use the Unicode case translation tables
when character input is provided. Note that uc(), or \U in
interpolated strings, translates to uppercase, while ucfirst,
or \u in interpolated strings, translates to titlecase in languages
that make the distinction.
-
Most operators that deal with positions or lengths in a string will
automatically switch to using character positions, including
chop(), chomp(), substr(), pos(), index(), rindex(),
sprintf(), write(), and length(). Operators that
specifically do not switch include vec(), pack(), and
unpack(). Operators that really don't care include
operators that treat strings as a bucket of bits such as sort(),
and operators dealing with filenames.
-
The pack()/unpack() letter C does not change, since it is often
used for byte-oriented formats. Again, think char in the C language.
There is a new U specifier that converts between Unicode characters
and code points. There is also a W specifier that is the equivalent of
chr/ord and properly handles character values even if they are above 255.
-
The chr() and ord() functions work on characters, similar to
pack("W") and unpack("W"), not pack("C") and
unpack("C"). pack("C") and unpack("C") are methods for
emulating byte-oriented chr() and ord() on Unicode strings.
While these methods reveal the internal encoding of Unicode strings,
that is not something one normally needs to care about at all.
-
The bit string operators, & | ^ ~, can operate on character data.
However, for backward compatibility, such as when using bit string
operations when characters are all less than 256 in ordinal value, one
should not use ~ (the bit complement) with characters of both
values less than 256 and values greater than 256. Most importantly,
DeMorgan's laws (~($x|$y) eq ~$x&~$y and ~($x&$y) eq ~$x|~$y)
will not hold. The reason for this mathematical faux pas is that
the complement cannot return both the 8-bit (byte-wide) bit
complement and the full character-wide bit complement.
-
lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
-
the case mapping is from a single Unicode character to another
single Unicode character, or
-
the case mapping is from a single Unicode character to more
than one Unicode character.
Things to do with locales (Lithuanian, Turkish, Azeri) do not work
since Perl does not understand the concept of Unicode locales.
See the Unicode Technical Report #21, Case Mappings, for more details.
But you can also define your own mappings to be used in the lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
See User-Defined Case Mappings for more details.
Named Unicode properties, scripts, and block ranges may be used like
character classes via the \p{} "matches property" construct and
the \P{} negation, "doesn't match property".
For instance, \p{Lu} matches any character with the Unicode "Lu"
(Letter, uppercase) property, while \p{M} matches any character
with an "M" (mark--accents and such) property. Brackets are not
required for single letter properties, so \p{M} is equivalent to
\pM. Many predefined properties are available, such as
\p{Mirrored} and \p{Tibetan}.
The official Unicode script and block names have spaces and dashes as
separators, but for convenience you can use dashes, spaces, or
underbars, and case is unimportant. It is recommended, however, that
for consistency you use the following naming: the official Unicode
script, property, or block name (see below for the additional rules
that apply to block names) with whitespace and dashes removed, and the
words "uppercase-first-lowercase-rest". Latin-1 Supplement thus
becomes Latin1Supplement.
You can also use negation in both \p{} and \P{} by introducing a caret
(^) between the first brace and the property name: \p{^Tamil} is
equal to \P{Tamil}.
NOTE: the properties, scripts, and blocks listed here are as of
Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
came out in April 2003, and Perl 5.8.1 in September 2003.
- General Category
-
Here are the basic Unicode General Category properties, followed by their
long form. You can use either; \p{Lu} and \p{UppercaseLetter},
for instance, are identical.
-
Short Long
-
L Letter
LC CasedLetter
Lu UppercaseLetter
Ll LowercaseLetter
Lt TitlecaseLetter
Lm ModifierLetter
Lo OtherLetter
-
M Mark
Mn NonspacingMark
Mc SpacingMark
Me EnclosingMark
-
N Number
Nd DecimalNumber
Nl LetterNumber
No OtherNumber
-
P Punctuation
Pc ConnectorPunctuation
Pd DashPunctuation
Ps OpenPunctuation
Pe ClosePunctuation
Pi InitialPunctuation
(may behave like Ps or Pe depending on usage)
Pf FinalPunctuation
(may behave like Ps or Pe depending on usage)
Po OtherPunctuation
-
S Symbol
Sm MathSymbol
Sc CurrencySymbol
Sk ModifierSymbol
So OtherSymbol
-
Z Separator
Zs SpaceSeparator
Zl LineSeparator
Zp ParagraphSeparator
-
C Other
Cc Control
Cf Format
Cs Surrogate (not usable)
Co PrivateUse
Cn Unassigned
-
Single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
LC and L& are special cases, which are aliases for the set of
Ll, Lu, and Lt.
-
Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates. Cs is therefore not
supported.
- Bidirectional Character Types
-
Because scripts differ in their directionality--Hebrew is
written right to left, for example--Unicode supplies these properties in
the BidiClass class:
-
Property Meaning
-
L Left-to-Right
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
AL Right-to-Left Arabic
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
ES European Number Separator
ET European Number Terminator
AN Arabic Number
CS Common Number Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals
-
For example, \p{BidiClass:R} matches characters that are normally
written right to left.
- Scripts
-
The script names which can be used by \p{...} and \P{...},
such as in \p{Latin} or \p{Cyrillic}, are as follows:
-
Arabic
Armenian
Bengali
Bopomofo
Buhid
CanadianAboriginal
Cherokee
Cyrillic
Deseret
Devanagari
Ethiopic
Georgian
Gothic
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hebrew
Hiragana
Inherited
Kannada
Katakana
Khmer
Lao
Latin
Malayalam
Mongolian
Myanmar
Ogham
OldItalic
Oriya
Runic
Sinhala
Syriac
Tagalog
Tagbanwa
Tamil
Telugu
Thaana
Thai
Tibetan
Yi
- Extended property classes
-
Extended property classes can supplement the basic
properties, defined by the PropList Unicode database:
-
ASCIIHexDigit
BidiControl
Dash
Deprecated
Diacritic
Extender
GraphemeLink
HexDigit
Hyphen
Ideographic
IDSBinaryOperator
IDSTrinaryOperator
JoinControl
LogicalOrderException
NoncharacterCodePoint
OtherAlphabetic
OtherDefaultIgnorableCodePoint
OtherGraphemeExtend
OtherLowercase
OtherMath
OtherUppercase
QuotationMark
Radical
SoftDotted
TerminalPunctuation
UnifiedIdeograph
WhiteSpace
-
and there are further derived properties:
-
Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
Lowercase Ll + OtherLowercase
Uppercase Lu + OtherUppercase
Math Sm + OtherMath
-
ID_Start Lu + Ll + Lt + Lm + Lo + Nl
ID_Continue ID_Start + Mn + Mc + Nd + Pc
-
Any Any character
Assigned Any non-Cn character (i.e. synonym for \P{Cn})
Unassigned Synonym for \p{Cn}
Common Any character (or unassigned code point)
not explicitly assigned to a script
- Use of "Is" Prefix
-
For backward compatibility (with Perl 5.6), all properties mentioned
so far may have Is prepended to their name, so \P{IsLu}, for
example, is equal to \P{Lu}.
- Blocks
-
In addition to scripts, Unicode also defines blocks of
characters. The difference between scripts and blocks is that the
concept of scripts is closer to natural languages, while the concept
of blocks is more of an artificial grouping based on groups of 256
Unicode characters. For example, the Latin script contains letters
from many blocks but does not contain all the characters from those
blocks. It does not, for example, contain digits, because digits are
shared across many scripts. Digits and similar groups, like
punctuation, are in a category called Common.
-
For more about scripts, see the UTR #24:
-
http://www.unicode.org/unicode/reports/tr24/
-
For more about blocks, see:
-
http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
Block names are given with the In prefix. For example, the
Katakana block is referenced via \p{InKatakana}. The In
prefix may be omitted if there is no naming conflict with a script
or any other property, but it is recommended that In always be used
for block tests to avoid confusion.
-
These block names are supported:
-
InAlphabeticPresentationForms
InArabic
InArabicPresentationFormsA
InArabicPresentationFormsB
InArmenian
InArrows
InBasicLatin
InBengali
InBlockElements
InBopomofo
InBopomofoExtended
InBoxDrawing
InBraillePatterns
InBuhid
InByzantineMusicalSymbols
InCJKCompatibility
InCJKCompatibilityForms
InCJKCompatibilityIdeographs
InCJKCompatibilityIdeographsSupplement
InCJKRadicalsSupplement
InCJKSymbolsAndPunctuation
InCJKUnifiedIdeographs
InCJKUnifiedIdeographsExtensionA
InCJKUnifiedIdeographsExtensionB
InCherokee
InCombiningDiacriticalMarks
InCombiningDiacriticalMarksforSymbols
InCombiningHalfMarks
InControlPictures
InCurrencySymbols
InCyrillic
InCyrillicSupplementary
InDeseret
InDevanagari
InDingbats
InEnclosedAlphanumerics
InEnclosedCJKLettersAndMonths
InEthiopic
InGeneralPunctuation
InGeometricShapes
InGeorgian
InGothic
InGreekExtended
InGreekAndCoptic
InGujarati
InGurmukhi
InHalfwidthAndFullwidthForms
InHangulCompatibilityJamo
InHangulJamo
InHangulSyllables
InHanunoo
InHebrew
InHighPrivateUseSurrogates
InHighSurrogates
InHiragana
InIPAExtensions
InIdeographicDescriptionCharacters
InKanbun
InKangxiRadicals
InKannada
InKatakana
InKatakanaPhoneticExtensions
InKhmer
InLao
InLatin1Supplement
InLatinExtendedA
InLatinExtendedAdditional
InLatinExtendedB
InLetterlikeSymbols
InLowSurrogates
InMalayalam
InMathematicalAlphanumericSymbols
InMathematicalOperators
InMiscellaneousMathematicalSymbolsA
InMiscellaneousMathematicalSymbolsB
InMiscellaneousSymbols
InMiscellaneousTechnical
InMongolian
InMusicalSymbols
InMyanmar
InNumberForms
InOgham
InOldItalic
InOpticalCharacterRecognition
InOriya
InPrivateUseArea
InRunic
InSinhala
InSmallFormVariants
InSpacingModifierLetters
InSpecials
InSuperscriptsAndSubscripts
InSupplementalArrowsA
InSupplementalArrowsB
InSupplementalMathematicalOperators
InSupplementaryPrivateUseAreaA
InSupplementaryPrivateUseAreaB
InSyriac
InTagalog
InTagbanwa
InTags
InTamil
InTelugu
InThaana
InThai
InTibetan
InUnifiedCanadianAboriginalSyllabics
InVariationSelectors
InYiRadicals
InYiSyllables
You can define your own character properties by defining subroutines
whose names begin with "In" or "Is". The subroutines can be defined in
any package. The user-defined properties can be used in the regular
expression \p and \P constructs; if you are using a user-defined
property from a package other than the one you are in, you must specify
its package in the \p or \P construct.
package main;
if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
package Lang;
if ($txt =~ /\p{IsForeign}+/) { ... }
Note that the effect is compile-time and immutable once defined.
The subroutines must return a specially-formatted string, with one
or more newline-separated lines. Each line must be one of the following:
-
Two hexadecimal numbers separated by horizontal whitespace (space or
tabular characters) denoting a range of Unicode code points to include.
-
Something to include, prefixed by "+": a built-in character
property (prefixed by "utf8::") or a user-defined character property,
to represent all the characters in that property; two hexadecimal code
points for a range; or a single hexadecimal code point.
-
Something to exclude, prefixed by "-": an existing character
property (prefixed by "utf8::") or a user-defined character property,
to represent all the characters in that property; two hexadecimal code
points for a range; or a single hexadecimal code point.
-
Something to negate, prefixed "!": an existing character
property (prefixed by "utf8::") or a user-defined character property,
to represent all the characters in that property; two hexadecimal code
points for a range; or a single hexadecimal code point.
-
Something to intersect with, prefixed by "&": an existing character
property (prefixed by "utf8::") or a user-defined character property,
for all the characters except the characters in the property; two
hexadecimal code points for a range; or a single hexadecimal code point.
For example, to define a property that covers both the Japanese
syllabaries (hiragana and katakana), you can define
sub InKana {
return <<END;
3040\t309F
30A0\t30FF
END
}
Imagine that the here-doc end marker is at the beginning of the line.
Now you can use \p{InKana} and \P{InKana}.
You could also have used the existing block property names:
sub InKana {
return <<'END';
+utf8::InHiragana
+utf8::InKatakana
END
}
Suppose you wanted to match only the allocated characters,
not the raw block ranges: in other words, you want to remove
the non-characters:
sub InKana {
return <<'END';
+utf8::InHiragana
+utf8::InKatakana
-utf8::IsCn
END
}
The negation is useful for defining (surprise!) negated classes.
sub InNotKana {
return <<'END';
!utf8::InHiragana
-utf8::InKatakana
+utf8::IsCn
END
}
Intersection is useful for getting the common characters matched by
two (or more) classes.
sub InFooAndBar {
return <<'END';
+main::Foo
&main::Bar
END
}
It's important to remember not to use "&" for the first set -- that
would be intersecting with nothing (resulting in an empty set).
A final note on the user-defined property tests: they will be used
only if the scalar has been marked as having Unicode characters.
Old byte-style strings will not be affected.
You can also define your own mappings to be used in the lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
The principle is similar to that of user-defined character
properties: to define subroutines in the main package
with names like ToLower (for lc() and lcfirst()), ToTitle (for
the first character in ucfirst()), and ToUpper (for uc(), and the
rest of the characters in ucfirst()).
The string returned by the subroutines needs now to be three
hexadecimal numbers separated by tabulators: start of the source
range, end of the source range, and start of the destination range.
For example:
sub ToUpper {
return <<END;
0061\t0063\t0041
END
}
defines an uc() mapping that causes only the characters "a", "b", and
"c" to be mapped to "A", "B", "C", all other characters will remain
unchanged.
If there is no source range to speak of, that is, the mapping is from
a single character to another single character, leave the end of the
source range empty, but the two tabulator characters are still needed.
For example:
sub ToLower {
return <<END;
0041\t\t0061
END
}
defines a lc() mapping that causes only "A" to be mapped to "a", all
other characters will remain unchanged.
(For serious hackers only) If you want to introspect the default
mappings, you can find the data in the directory
$Config{privlib}/unicore/To/. The mapping data is returned as
the here-document, and the utf8::ToSpecFoo are special exception
mappings derived from <$Config{privlib}>/unicore/SpecialCasing.txt.
The Digit and Fold mappings that one can see in the directory
are not directly user-accessible, one can use either the
Unicode::UCD module, or just match case-insensitively (that's when
the Fold mapping is used).
A final note on the user-defined case mappings: they will be used
only if the scalar has been marked as having Unicode characters.
Old byte-style strings will not be affected.
See the Encode manpage.
The following list of Unicode support for regular expressions describes
all the features currently supported. The references to "Level N"
and the section numbers refer to the Unicode Technical Report 18,
"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
Perl 5.8.0).
-
Level 1 - Basic Unicode Support
2.1 Hex Notation - done [1]
Named Notation - done [2]
2.2 Categories - done [3][4]
2.3 Subtraction - MISSING [5][6]
2.4 Simple Word Boundaries - done [7]
2.5 Simple Loose Matches - done [8]
2.6 End of Line - MISSING [9][10]
|