Re: Can I prevent XML::DOM::Parser from resolving character entities?
by Grant McLean other posts by this author
Jul 12 2007 6:32PM messages near this date
view in the new Beta List Site
Re: Can I prevent XML::DOM::Parser from resolving character entities?
|
Re: Can I prevent XML::DOM::Parser from resolving character entities?
& XSLT On Thu, 2007-07-12 at 19:55 -0500, Michael Boudreau wrote:
> Thanks! That does what I need, except...
>
> My experience doesn't quite match what the FAQ says to expect. Using Perl
> 5.6.0:
>
> use utf8;
> s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse;
>
> Produces:
>
> In XML input: Output after regex:
> ™ => ™ [trademark symbol]
> é => é [lowercase e with acute accent]
>
>
> use utf8; # [note the FAQ says this is not required with 5.6]
It's not required with 5.8. It is required with 5.6.
> s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse;
Sorry, I keep forgetting to update the FAQ you probably really want:
s/([^\x00-\x7F])/'&#' . ord($1) . ';'/gse;
Otherwise it does all your CR, LF and Tab characters too.
> Produces:
>
> In XML input: Output after regex:
> ™ => ™
> é => é
>
> But leaving out 'use utf8'; and still using the second regex:
>
> In XML input: Output after regex:
> ™ => â„¢
> é => é
Here's a short test script that demonstrates the regex working in 5.8
without the 'use utf8' line:
#!/usr/bin/perl
require 5.008;
use strict;
use warnings;
my $string = "TM: \x{2122}";
$string =~ s/([^\x00-\x7F])/'&#' . ord($1) . ';'/gse;
print $string, "\n";
which outputs:
TM: ™
Cheers
Grant
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Michael Boudreau
Forrest Cahoon
Grant McLean
Michael Boudreau
Grant McLean
Aaron Crane
|