Lingua::JA::Sort::JIS 0.04
Perl
module
-
Part of CPAN
distribution
Lingua-JA-Sort-JIS 0.04.
Lingua::JA::Sort::JIS -
a perl module compares and sorts strings
in the UTF-8 encoding
using the collation of Japanese character strings
of JIS (Japanese Industrial Standards).
use Lingua::JA::Sort::JIS qw(jsort);
@result = jsort(
qw/ ãã³ã ããã ã²ãã ãã¬ã ããã¿ ãã ã㬠ãã³ ã©ã¤ãªã³ /
);
# result: qw/ ã㬠ããã ãã ãã¬ã ãã³ ããã¿ ãã³ã ã²ãã ã©ã¤ãªã³ /
This module provides some functions to compare and sort strings
in the UTF-8 encoding (EUC-JP or Shift_JIS are NOT permitted!)
using the collation of Japanese character strings.
This module is an implementation of JIS X 4061-1996 and
the collation rules are based on that standard.
The following criteria are considered in order
until the collation order is determined.
By default, Levels 1 to 4 are applied and Level 5 is ignored
(as JIS does).
- Level 1: alphabetic ordering.
-
The character class early appeared in the following list is smaller.
Space characters, Symbols and Punctuations, Digits, Greek Letters,
Cyrillic Letters, Latin letters, Kana letters, ( Kanji ideographs ),
and Geta mark.
In the class, alphabets are collated alphabetically;
kana letters are AIUEO-betically (in the Gozyuon order).
For Kanji, see Kanji Classes.
Other characters are collated as defined.
Characters not defined as a collation element are
ignored and skipped on collation.
BN: Especially, almost alphabets with any diacritical mark
are NOT defined in this implement,
excepting Latin vowels with macron or circumflex,
because they are not used in Japanese contexts.
- Level 2: diacritic ordering.
-
In the Latin vowels, the order is as shown the following list.
One without diacritical mark, with macron, then with circumflex.
In kana, the order is as shown the following list.
A voiceless kana, the voiced, then the semi-voiced (if exists).
(eg. Ka before Ga; Ha before Ba before Pa)
- Level 3: case ordering.
-
A small Latin is lesser than the corresponding Capital.
In kana, the order is as shown the following list.
replaced PROLONGED SOUND MARK(U+30FC);
Small kana;
replaced ITERATION MARK (U+309D, U+309E, U+30FD or U+30FE);
then normal kana.
For example, Katakana A + PROLONGED SOUND MARK,
Katakana A + Small Katakana A,
Katakana A + ITERATION MARK,
Katakana A + Katakana A.
(see NOTE about the replacement)
- Level 4: variant ordering.
-
Hiragana is lesser than katakana.
- Level 5: width ordering.
-
A character that belongs to the block Halfwidth and Fullwidth Forms
is greater than the corresponding normal character.
BN: According to the JIS standard, the level 5 should be ignored.
There are three kanji classes:
- Class 1: the 'saisho' (minimum) kanji class
-
It comprises five kanji-like chars,
i.e. U+3003, U+3005, U+4EDD, U+3006, U+3007.
Any kanji except U+4EDD are ignored on collation.
- Class 2: the 'kihon' (basic) kanji class
-
It comprises JIS levels 1 and 2 kanji in addition to
the minimum kanji class. Sorted in the JIS order.
Any kanji excepting those defined by JIS X 0208 are ignored on collation.
- Class 3: the 'kakucho' (extended) kanji class
-
All the CJK Unified Ideographs in addition to
the minimum kanji class. Sorted in the unicode order.
$jis = Lingua::JA::Sort::JIS->new()
-
$jis = Lingua::JA::Sort::JIS->new(LEVEL)
-
$jis = Lingua::JA::Sort::JIS->new(LEVEL, KANJI CLASS)
-
$jis = Lingua::JA::Sort::JIS->new(CODE REF, LEVEL, KANJI CLASS)
-
Constructs an instance.
The collation level is specified as a number
between 1 and 5. If omitted, level 4 is applied.
The kanji class is specified as a number between 1 and 3.
If omitted, class 2 is applied.
If a coderef is specified as the first argument,
strings given to a collating method are converted by the coderef
before making collating keys.
For example, if you want to ignore PROLONGED SOUND MARK on collation,
use Lingua::JA::Sort::JIS;
$jis = Lingua::JA::Sort::JIS->new(
sub { my $str = shift; $str =~ s/ã¼//g; $str; }
);
@sorted = $jis->jsort(@strings); # utf-8 encoded
If you want to collate EUC-JP-encoded strings,
give the constructor a coderef converting EUC-JP to UTF-8.
use Lingua::JA::Sort::JIS;
use Jcode;
$euc = Lingua::JA::Sort::JIS->new(
sub {Jcode->new($_[0], 'euc')->utf8},
);
@sorted_euc_jp_strings = $euc->jsort(@euc_jp_strings);
$jis->jsort(LIST)
-
Sorts a list of strings in the UTF-8 encoding
$jis->jcmp($a, $b)
-
Japanese Collation version of the
cmp operator.
It returns 1 ($a is greater than $b)
or 0 ($a is equal to $b)
or -1 ($a is lesser than $b).
jsort(LIST)
-
jsort(CODE REF, LIST)
-
Sorts a list of strings in the UTF-8 encoding
(as the collation level and the kanji class, the default values are used,
and jsort() without any object is identical to bsort()).
If a coderef is specified as the first argument,
strings given to a collating method are converted by the coderef
before making collating keys.
For example, if you want to collate Shift_JIS-encoded strings,
do as following.
use Jcode;
use Lingua::JA::Sort::JIS qw(jsort);
$sjis_to_utf8 = sub {Jcode->new($_[0], 'sjis')->utf8};
@sorted = jsort $sjis_to_utf8, @not_sorted;
msort(LIST)
-
msort(CODE REF, LIST)
-
Sorts a list of strings in the UTF-8 encoding
(the collation level is 4 and the kanji class is 1,
m: minimum).
bsort(LIST)
-
bsort(CODE REF, LIST)
-
Sorts a list of strings in the UTF-8 encoding
(the collation level is 4 and the kanji class is 2,
b: basic).
xsort(LIST)
-
xsort(CODE REF, LIST)
-
Sorts a list of strings in the UTF-8 encoding
(the collation level is 4 and the kanji class is 3,
x: extented).
fsort(LIST)
-
fsort(CODE REF, LIST)
-
Sorts a list of strings in the UTF-8 encoding
(the collation level is 5 and the kanji class is 2,
f: fullwidth).
jcmp( [ CODEREF ], $a, $b, [ LEVEL, KANJI CLASS ])
-
Japanese Collation version of the cmp operator.
It returns 1 ($a is greater than $b)
or 0 ($a is equal to $b)
or -1 ($a is lesser than $b).
The LEVEL (collation level) is specified as a number
between 1 and 5. If omitted, level 4 is applied.
The KANJI CLASS (kanji class) is specified as a number between 1 and 3.
If omitted, class 2 is applied.
If CODE REF is specified as the first argument,
strings given to a collating method are converted by the coderef
before making collating keys.
The CODE REF, LEVEL and the KANJI CLASS can be omitted
if not necessary.
e.g. jcmp("perl", "Perl") returns -1
and jcmp("perl", "Perl", 2) returns 0
since "perl" is tertiary and quarternary less than
"Perl", and secondary equal to.
karr([ CODE REF ], STRING, [ KANJI CLASS ] )
-
kcmp(KEY ARRAY, KEY ARRAY, [ LEVEL ])
-
These functions allow you to do the Schwartzian transform.
karr() makes KEY ARRAY from STRING.
kcmp() returns
1 (The first KEY ARRAY is greater than the second KEY ARRAY)
or 0 (The first KEY ARRAY is equal to the second KEY ARRAY)
or -1 (The first KEY ARRAY is lesser than the second KEY ARRAY).
The CODE REF, LEVEL and the KANJI CLASS
can be omitted if not necessary.
The following example is sorting by "yomi-hyoki" collation, in which
"yomi" (or pronunciation) is used as the first sorting key, and
"hyoki" (or spell) is used as the second sorting key.
- by OOP
-
use Lingua::JA::Sort::JIS;
$jis = Lingua::JA::Sort::JIS->new();
foreach(ysort(@data)){
print "@$_\n";
}
sub ysort {
map { $_->[0] }
sort{
$jis->kcmp($a->[1], $b->[1]) ||
$jis->kcmp($a->[2], $b->[2])
}
map { [$_, $jis->karr($_->[1]),
$jis->karr($_->[0]) ] } @_;
}
- by not-OOP
-
use Lingua::JA::Sort::JIS qw(kcmp karr);
foreach(ysort(@data)){
print "@$_\n";
}
sub ysort {
map { $_->[0] }
sort{ kcmp($a->[1], $b->[1]) ||
kcmp($a->[2], $b->[2]) }
map { [$_, karr($_->[1]), karr($_->[0]) ] } @_;
}
- Definition of
@data in this example
-
@data = (
[qw/ å°å±± ããã¾ (æ ª)ã»ãã»ã /],
[qw/ é·ç° ãªãã å¸å¹åäº /],
[qw/ ç°ä¸ ããªã âÃç©ç£ /],
[qw/ é´æ¨ ããã ï¼ï¼ç²¾æ© /],
[qw/ å°å¶ ããã¾ ï¼ï¼æ°´ç£ /],
[qw/ å
å³¶ ããã¾ ï¼ï¼å /],
[qw/ é·ç° ããã ï¼
ï¼
éè¡ /],
[qw/ å°å±± ããã¾ ï¼ ï¼ é»é /],
[qw/ å°å³¶ ãã㾠¥¥ç¾è²¨åº /],
[qw/ å±±ç° ãã¾ã ï¼ï¼é£å /],
[qw/ æ°¸ç° ãªãã ï¼ï¼è£½è¬ /],
);
- Result
-
é·ç° ããã ï¼
ï¼
éè¡
å°å±± ããã¾ ï¼ ï¼ é»é
å
å³¶ ããã¾ ï¼ï¼å
å°å³¶ ãã㾠¥¥ç¾è²¨åº
å°å¶ ããã¾ ï¼ï¼æ°´ç£
å°å±± ããã¾ (æ ª)ã»ãã»ã
é´æ¨ ããã ï¼ï¼ç²¾æ©
ç°ä¸ ããªã âÃç©ç£
æ°¸ç° ãªãã ï¼ï¼è£½è¬
é·ç° ãªãã å¸å¹åäº
å±±ç° ãã¾ã ï¼ï¼é£å
getorder()
-
In the list context, it returns the collation element hash;
otherwise, it returns the reference of that hash.
In the collation element hash, each key is
the collation element string and each value is
the anonymous array with 5 elements.
You can manipulate the collation element hash like as follows.
my $order = getorder();
# delete 'X' from the collation element hash
delete $order->{'X'};
# swap the collation order between 'b' and 'B';
@$order{'B', 'b'} = @$order{'b', 'B'};
# add a new collation element HIRAGANA LETTER VU;
my $hira_vu = "\xE3\x82\x94";
my $kata_vu = "\xE3\x83\xB4";
$order->{$hira_vu} = [ @{ $order->{$kata_vu} } ];
-- $order->{$hira_vu}[3];
# HIRAGANA VU to be quarternary lesser than KATAKANA VU.
RFC1345 UCS
[*5] U+309D HIRAGANA ITERATION MARK
[+5] U+309E HIRAGANA VOICED ITERATION MARK
[-6] U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
[*6] U+30FD KATAKANA ITERATION MARK
[+6] U+30FE KATAKANA VOICED ITERATION MARK
To represent Japanese characters,
RFC 1345 Mnemonic characters enclosed by brackets
are used below.
These characters, if replaced, are secondary equal to
the replacing kana, while ternary not equal to.
- KATAKANA-HIRAGANA PROLONGED SOUND MARK
-
The PROLONGED MARK is repleced by normal vowel or nasal
katakana corresponding to the preceding kana if exists.
eg. [Ka][-6] to [Ka][A6]
[bi][-6] to [bi][I6]
[Pi][YU][-6] to [Pi][YU][U6]
[N6][-6] to [N6][N6]
- HIRAGANA- and KATAKANA ITERATION MARKs
-
The ITERATION MARKs (VOICELESS) are repleced
by normal kana corresponding to the preceding kana if exists.
eg. [Ka][*6] to [Ka][Ka]
[Do][*5] to [Do][to]
[n5][*5] to [n5][n5]
[Pu][*6] to [Pu][Hu]
[Pi][YU][*6] to [Pi][YU][Yu]
- HIRAGANA- and KATAKANA VOICED ITERATION MARKs
-
The VOICED ITERATION MARKs are repleced by the voiced kana
corresponding to the preceding kana if exists.
eg. [ha][+5] to [ha][ba]
[Pu][+5] to [Pu][bu]
[Ko][+6] to [Ko][Go]
[U6][+6] to [U6][Vu]
- Cases of no replacement
-
Otherwise, no replacement occurs. Especially in the
cases when these marks follow any character except kana.
The characters not replaced are primary
greater than any kana (see "Collate.txt").
eg. CJK followed by PROLONGED SOUND MARK
DIGIT followed by ITERATION
[A6][+6] ([A6] has no voiced variant)
- Example
-
For example, the Japanese string [Pa][-6][Ru] (spell of Perl in Japanese)
has three collation elements: KATAKANA PA,
PROLONGED SOUND MARK replaced by KATAKANA A, and KATAKANA RU.
[Pa][-6][Ru] is converted to [Pa][A6][Ru] by replacement.
primary equal to [ha][a5][ru].
secondary equal to [pa][a5][ru], greater than [ha][a5][ru].
tertiary equal to [pa][-6][ru], lesser than [Pa][A6][Ru].
quartenary greater than [pa][-6][ru].
[according to the article 6.2, JIS X 4061]
(1) charset: UTF-8.
(2) No limit of the number of characters in the string considered
to collate.
(3) No character class is added.
(4) The following characters are added as collation elements.
IDEOGRAPHIC SPACE in the space class.
ACUTE ACCENT, GRAVE ACCENT, DIAERESIS, CIRCUMFLEX ACCENT,
MACRON, HORIZONTAL BAR, EN DASH, TILDE, PARALLEL TO
in the class of descriptive symbols.
APOSTROPHE, QUOTATION MARK in the class of parentheses.
HYPHEN-MINUS in the class of mathematical symbols.
(5) Collation of Latin alphabets with macron and with circumflex
is supported.
(6) Selected kanji class:
the minimum kanji class (Five kanji-like chars).
the basic kanji class (Levels 1 and 2 kanji, JIS).
the extended kanji class (CJK Unified Ideographs).
Tomoyuki SADAHIRO
bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/
This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
- JIS X 4061 [Collation of Japanese character strings]
- JIS X 0208 [7-bits and 8-bits double byte coded Kanji sets
for information interchange]
- JIS X 0221 [Information technology - Universal Multiple-Octet Coded
Character Set (UCS) - part 1 : Architectute and Basic Multilingual Plane].
That is translated from ISO/IEC 10646-1 and introduced into JIS.
- Japanese Standards Association (access to JIS)
http://www.jsa.or.jp/
- RFC 1345 [Character Mnemonics & Character Sets]
|