ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Re: Can I prevent XML::DOM::Parser from resolving character entities?
by Grant McLean other posts by this author
Jul 12 2007 6:32PM messages near this date
view in the new Beta List Site
Re: Can I prevent XML::DOM::Parser from resolving character entities? | Re: Can I prevent XML::DOM::Parser from resolving character entities?
& XSLT On Thu, 2007-07-12 at 19:55 -0500, Michael Boudreau wrote:
>  Thanks! That does what I need, except...
>  
>  My experience doesn't quite match what the FAQ says to expect. Using Perl
>  5.6.0:
>  
>     use utf8;
>     s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse;
>  
>     Produces:
>  
>     In XML input:     Output after regex:
>     ™      =>  ™   [trademark symbol]
>     é      =>  é    [lowercase e with acute accent]
>  
>  
>     use utf8;  # [note the FAQ says this is not required with 5.6]

It's not required with 5.8.  It is required with 5.6.

>     s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse;

Sorry, I keep forgetting to update the FAQ you probably really want:

  s/([^\x00-\x7F])/'&#' . ord($1) . ';'/gse;

Otherwise it does all your CR, LF and Tab characters too.

>     Produces:
>  
>     In XML input:     Output after regex:
>     ™      =>  ™
>     é      =>  é
>  
>     But leaving out 'use utf8'; and still using the second regex:
>  
>     In XML input:     Output after regex:
>     ™      =>  â„¢
>     é      =>  é

Here's a short test script that demonstrates the regex working in 5.8
without the 'use utf8' line:

#!/usr/bin/perl

require 5.008;
use strict;
use warnings;

my $string = "TM: \x{2122}";

$string =~ s/([^\x00-\x7F])/'&#' . ord($1) . ';'/gse;

print $string, "\n";

which outputs:

TM: ™

Cheers
Grant

_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Michael Boudreau
Forrest Cahoon
Grant McLean
Michael Boudreau
Grant McLean
Aaron Crane

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved