ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Re: Continuing Perl 5.8.0 Problems
by Steve Hay other posts by this author
Oct 2 2002 4:30PM messages near this date
view in the new Beta List Site
Continuing Perl 5.8.0 Problems | Re: Continuing Perl 5.8.0 Problems
Petr Pajas wrote:

> Steve Hay <steve.hay@[...].com> writes:
>   
> 
> >Thus,
> >
> >    $str = decode('utf8', $str);
> >    
> >
> no, I hope I wrote this is supposed to read the UTF-8 encoded octet,
> check if it is a valid UTF-8 encoded string and turn the UTF8 flag ON.
> 
> if you want to turn it OFF for a UTF-8 encoded string, you simply use
> 
> $str = encode('utf8', $str);
> 
> It takes the input string with UTF-8 flag on and "encodes" it into a
> UTF-8 octet, which effectively means that it only takes the UTF-8
> flag out.
> 
I've now tried this as well: initially it croaked with the error "can't 
convert!", presumably on one of the strings that was not flagged UTF-8 
to start with.

So then I tried:

    $str = encode('utf8', $str) if Encode::is_utf8($str);

This now produces the same result as:

    Encode::_utf8_off($str);

i.e. no errors from XmlUtf8Decode() any more, but various other tests 
still fail, and I still don't know if this is the right thing to be 
doing.  The reason that I was trying to turn the UTF-8 flag off is that 
the substitution being done by encodeText() fails if the flag is on. 
 The following simple program demonstrates the same thing, can you 
explain it?  It attempts to change a pair of UTF-8 encoded characters to 
the ASCII `.' character (decimal 46):-

# --- START OF SCRIPT ---
use strict;
use warnings;
use bytes;
use Encode;

sub dot { return chr(46) }

my $str = decode('utf8', (chr(194) . chr(129)) x 2);

my $str1 = $str;
printf "str1 = %vd, UTF-8 flag is %s\n",
    $str1, Encode::is_utf8($str1) ? 'ON' : 'OFF';

my $res1 = $str1 =~ s/(\xC2.)/dot()/egs;
printf "str1 = %vd after $res1 substitutions\n", $str1;

my $str2 = encode('utf8', $str);
printf "str2 = %vd, UTF-8 flag is %s\n",
    $str2, Encode::is_utf8($str2) ? 'ON' : 'OFF';

my $res2 = $str2 =~ s/(\xC2.)/dot()/egs;
printf "str2 = %vd after $res2 substitutions\n", $str2;
# --- END OF SCRIPT ---

This produces the following output:

# --- START OF OUTPUT ---
str1 = 194.129.194.129, UTF-8 flag is ON
str1 = 194.129.194.129 after  substitutions
str2 = 194.129.194.129, UTF-8 flag is OFF
str2 = 46.46 after 2 substitutions
# --- END OF OUTPUT ---

Why does the substitution fail if the UTF-8 flag is on?

- Steve

_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Steve Hay
Steve Hay
Robin Berjon
Steve Hay
Petr Pajas
Robin Berjon
Steve Hay
Petr Pajas

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved