ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
how prevent XML::Parser from resolving entity references?
by Alois Heuboeck other posts by this author
Sep 15 2005 1:14PM messages near this date
view in the new Beta List Site
Re: XML to SOAP::Data - done | Re: how prevent XML::Parser from resolving entity references?
& XSLT Hello,


I'm trying to do the following:

1- take an XML file
2- in one script, replace everything above Unicode #x7F (end of ASCII) 
with entity references (which can either have "special" names, like 
ä or be based on the Unicode nb. like ®)
3- then in another script, do some more transformations using XML::DOM and
4- print out resulting XML


My problem is that in the third step, when parsing its input, the 
XML::Parser seems to resolve those references that contain the HEX 
Unicode nb.; the "special name" references are not resolved.


My input looks somewhat like this:


     <?xml version="1.0" encoding="utf-8"?> 
     <!DOCTYPE TEI.2 SYSTEM "E:/TEI.dtd"> 
     <TEI.2> 
         <w:t> 
         &auml; NetMachanic&#x00AE;technical evaluation
         </w:t> 
         <w:t> 
         &#x00E2;and LinkPopularity are two tools for organisation.
         </w:t> 
         <w:t>  &#x00E2;&#x00E2;&#x00E2;&#x00E2; </w:t>
         <w:t>  &#x00AE;&#x00AE;&#x00AE;&#x00AE; </w:t>
     </TEI.2> 



I tried the option NoExpand and also implemented a default handler, 
which "will be called when an entity reference is seen in text" 
(http://www.socsci.umn.edu/ssrf/doc/xml/enno-xml-docs/users.erols.com/enno/xml/XML/Parser/Ex
pat.html),
so I have:

--------------------

#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;

my $infile = "INFILE.xml";
my $dom_parser = new XML::DOM::Parser(
             NoExpand =>  1,
             Handlers =>  {
                 Default=> \&handle_default,
                 Char=> \&handle_char,
             });

my $TREE = $dom_parser-> parsefile($infile);

# here transform $TREE with XML::DOM

open OUT, "> OUTFILE.xml" or die "cannot write to OUT file";
print OUT $TREE-> toString();
close OUT;



sub handle_char {

     my ($parser, $string) = @_;
     my $rec = $parser-> recognized_string();
     my $esc = $parser-> xml_escape($rec);

     open LOG, "> >log.txt";
     print LOG "\n--\ncall of handle_char()\n";
     print LOG "[$string||$rec//$esc]\n";
}

sub handle_default {

     my ($parser, $string) = @_;
     my $rec = $parser-> recognized_string();
     my $esc = $parser-> xml_escape($rec);

     open LOG, "> >log.txt";
     print LOG "\n--\ncall of handle_default()\n";
     print LOG "[$string||$rec//$esc]\n";
}


--------------------

Now, my problems:

First, handle_default() is not called for the entity references &#x00AE; 
and &#x00E2; but only for &auml;
&#x00AE; and &#x00E2; trigger handle_Char() instead.

Second, the NoExpand option does not what I thought it would, namely not 
expand the entity references.

Finally, the unresolved string in handle_Char() can be seen in $rec and 
$esc; the resolved one is in $string.
But how can I get this out to $TREE? All the textbook examples of 
handlers I saw just printed out some message.


Another strange thing occurs in the last two <w:t>  elements:
the first are four references to small letter a with circumflex; the 
second one four references to the REGISTERED TRADEMARK SIGN. What I get 
(when I don't set the Default and Char handlers) is:
<t>  &#14467;â </t> for the first and
<t>  ®®®® </t> four (R) for the second
In the first case, resolving the reference &#x00E2; seems to "eat" some 
of the following characters (also occurs when followed by normal 
character text).


Could anyone please give advice? Thanks,

Alois



_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Alois Heuboeck
Petr Cimprich
Alois Heuboeck

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved