Continuing Perl 5.8.0 Problems
by Steve Hay other posts by this author
Oct 2 2002 9:46AM messages near this date
view in the new Beta List Site
Re: libxml-perl-0.07 and perl-5.8.0 make test error
|
Re: Continuing Perl 5.8.0 Problems
Hi,
Continuing my investigations into the problems that I've previously
reported concerning XML::Parser/XML::DOM under Perl 5.8.0, I've now
produced a very short Perl script that uses neitehr module but
reproduces the error reported by XML::DOM's XmlUtf8Decode() function.
The following program builds a string containing twelve bytes that could
be interpreted as four UTF-8 characters, and then attempts to do what
XML::DOM::encodeText() does, using the XmlUtf8Decode() that I've
cut-and-pasted from XML::DOM:-
# --- START SCRIPT ---
use strict;
use warnings;
use bytes;
my $str = chr(227) . chr(131) . chr(155) .
chr(227) . chr(131) . chr(188) .
chr(227) . chr(131) . chr(158) .
chr(227) . chr(131) . chr(188);
$str =~ s/([\xC0-\xDF].|[\xE0-\xEF]..|[\xF0-\xFF]...)/XmlUtf8Decode($1)/egs;
printf STDERR "str = %s\n", $str;
sub XmlUtf8Decode
{
my ($str, $hex) = @_;
my $len = length ($str);
my $n;
printf STDERR "Decoding: %vd\n", $str;
if ($len == 2)
{
my @n = unpack "C2", $str;
$n = (($n[0] & 0x3f) << 6) + ($n[1] & 0x3f);
}
elsif ($len == 3)
{
my @n = unpack "C3", $str;
$n = (($n[0] & 0x1f) << 12) + (($n[1] & 0x3f) << 6) +
($n[2] & 0x3f);
}
elsif ($len == 4)
{
my @n = unpack "C4", $str;
$n = (($n[0] & 0x0f) << 18) + (($n[1] & 0x3f) << 12) +
(($n[2] & 0x3f) << 6) + ($n[3] & 0x3f);
}
elsif ($len == 1) # just to be complete...
{
$n = ord ($str);
}
else
{
# croak "bad value [$str] for XmlUtf8Decode";
die "bad value [" . sprintf('%vd', $str) . "] for XmlUtf8Decode\n";
}
$hex ? sprintf ("&#x%x;", $n) : "&#$n;";
}
# --- END SCRIPT ---
As it stands, this program runs fine, outputting:
Decoding: 227.131.155
Decoding: 227.131.188
Decoding: 227.131.158
Decoding: 227.131.188
str = ホーマー
However, inserting a call to Encode::is_utf8() into
XML::DOM::encodeText(), we find that the $str that it is working on when
it fails has the UTF-8 flag set. (This is because the XML::Parser
module's Expat.xs file uses the SvUTF8_on macro if it is defined.)
So to reproduce the XML::DOM module's problem we must set the UTF-8 flag
on our $str as well. Inserting the following two lines:
require Encode;
Encode::_utf8_on($str);
after the definition of $str in the program above and then re-running it
now produces the output:
Decoding: 227.131.155
Decoding: 131.158.227.131.188
bad value [131.158.227.131.188] for XmlUtf8Decode
exactly like the errors reported by XML::DOM's test suite.
It would appear that the s///egs has matched the wrong parts of the $str
-- XmlUtf8Decode() has been called with the value 131.158.227.131.188,
which should never happen.
The following even shorter program demonstrates the same problem:
# --- START SCRIPT ---
use strict;
use warnings;
use bytes;
my $str = chr(227) . chr(131) . chr(155) .
chr(227) . chr(131) . chr(188) .
chr(227) . chr(131) . chr(158) .
chr(227) . chr(131) . chr(188);
require Encode;
Encode::_utf8_on($str);
$str =~ s/([\xC0-\xDF].|[\xE0-\xEF]..|[\xF0-\xFF]...)/dot()/egs;
printf STDERR "str = %vd\n", $str;
sub dot { '.' }
# --- END SCRIPT ---
This outputs:
str = 46.227.131.188.227.46
instead of the expected:
str = 46.46.46.46
Again, removing the Encode::_utf8_on() call "fixes" it.
Even more bizarrely, removing the /e modified from the s///egs "fixes"
it too: doing so, and also changing the '%vd' to '%s' in the printf() as
well, outputs:
str = dot()dot()dot()dot()
even with the Encode::_utf8_on() call left in!!!
Does anybody understand what is going on here and what I need to do to
XML::DOM to fix it properly?
- Steve
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Steve Hay
Steve Hay
Robin Berjon
Steve Hay
Petr Pajas
Robin Berjon
Steve Hay
Petr Pajas
|