ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Continuing Perl 5.8.0 Problems
by Steve Hay other posts by this author
Oct 2 2002 9:46AM messages near this date
view in the new Beta List Site
Re: libxml-perl-0.07 and perl-5.8.0 make test error | Re: Continuing Perl 5.8.0 Problems
Hi,

Continuing my investigations into the problems that I've previously 
reported concerning XML::Parser/XML::DOM under Perl 5.8.0, I've now 
produced a very short Perl script that uses neitehr module but 
reproduces the error reported by XML::DOM's XmlUtf8Decode() function.

The following program builds a string containing twelve bytes that could 
be interpreted as four UTF-8 characters, and then attempts to do what 
XML::DOM::encodeText() does, using the XmlUtf8Decode() that I've 
cut-and-pasted from XML::DOM:-

# --- START SCRIPT ---
use strict;
use warnings;
use bytes;

my $str = chr(227) . chr(131) . chr(155) .
          chr(227) . chr(131) . chr(188) .
          chr(227) . chr(131) . chr(158) .
          chr(227) . chr(131) . chr(188);

$str =~ s/([\xC0-\xDF].|[\xE0-\xEF]..|[\xF0-\xFF]...)/XmlUtf8Decode($1)/egs;

printf STDERR "str = %s\n", $str;

sub XmlUtf8Decode
{
    my ($str, $hex) = @_;
    my $len = length ($str);
    my $n;
    printf STDERR "Decoding: %vd\n", $str;

    if ($len == 2)
    {
    my @n = unpack "C2", $str;
    $n = (($n[0] & 0x3f) << 6) + ($n[1] & 0x3f);
    }
    elsif ($len == 3)
    {
    my @n = unpack "C3", $str;
    $n = (($n[0] & 0x1f) << 12) + (($n[1] & 0x3f) << 6) +
        ($n[2] & 0x3f);
    }
    elsif ($len == 4)
    {
    my @n = unpack "C4", $str;
    $n = (($n[0] & 0x0f) << 18) + (($n[1] & 0x3f) << 12) +
        (($n[2] & 0x3f) << 6) + ($n[3] & 0x3f);
    }
    elsif ($len == 1)    # just to be complete...
    {
    $n = ord ($str);
    }
    else
    {
#    croak "bad value [$str] for XmlUtf8Decode";
    die "bad value [" . sprintf('%vd', $str) . "] for XmlUtf8Decode\n";
    }
    $hex ? sprintf ("&#x%x;", $n) : "&#$n;";
}
# --- END SCRIPT ---

As it stands, this program runs fine, outputting:

    Decoding: 227.131.155
    Decoding: 227.131.188
    Decoding: 227.131.158
    Decoding: 227.131.188
    str = &#12507;&#12540;&#12510;&#12540;

However, inserting a call to Encode::is_utf8() into 
XML::DOM::encodeText(), we find that the $str that it is working on when 
it fails has the UTF-8 flag set.  (This is because the XML::Parser 
module's Expat.xs file uses the SvUTF8_on macro if it is defined.)

So to reproduce the XML::DOM module's problem we must set the UTF-8 flag 
on our $str as well.  Inserting the following two lines:

    require Encode;
    Encode::_utf8_on($str);

after the definition of $str in the program above and then re-running it 
now produces the output:

    Decoding: 227.131.155
    Decoding: 131.158.227.131.188
    bad value [131.158.227.131.188] for XmlUtf8Decode

exactly like the errors reported by XML::DOM's test suite.

It would appear that the s///egs has matched the wrong parts of the $str 
-- XmlUtf8Decode() has been called with the value 131.158.227.131.188, 
which should never happen.

The following even shorter program demonstrates the same problem:

# --- START SCRIPT ---
use strict;
use warnings;
use bytes;

my $str = chr(227) . chr(131) . chr(155) .
          chr(227) . chr(131) . chr(188) .
          chr(227) . chr(131) . chr(158) .
          chr(227) . chr(131) . chr(188);

require Encode;
Encode::_utf8_on($str);

$str =~ s/([\xC0-\xDF].|[\xE0-\xEF]..|[\xF0-\xFF]...)/dot()/egs;

printf STDERR "str = %vd\n", $str;

sub dot { '.' }
# --- END SCRIPT ---

This outputs:

    str = 46.227.131.188.227.46

instead of the expected:

    str = 46.46.46.46

Again, removing the Encode::_utf8_on() call "fixes" it.

Even more bizarrely, removing the /e modified from the s///egs "fixes" 
it too: doing so, and also changing the '%vd' to '%s' in the printf() as 
well, outputs:

    str = dot()dot()dot()dot()

even with the Encode::_utf8_on() call left in!!!

Does anybody understand what is going on here and what I need to do to 
XML::DOM to fix it properly?

- Steve


_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Steve Hay
Steve Hay
Robin Berjon
Steve Hay
Petr Pajas
Robin Berjon
Steve Hay
Petr Pajas

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved