how prevent XML::Parser from resolving entity references?
by Alois Heuboeck other posts by this author
Sep 15 2005 1:14PM messages near this date
view in the new Beta List Site
Re: XML to SOAP::Data - done
|
Re: how prevent XML::Parser from resolving entity references?
& XSLT Hello,
I'm trying to do the following:
1- take an XML file
2- in one script, replace everything above Unicode #x7F (end of ASCII)
with entity references (which can either have "special" names, like
ä or be based on the Unicode nb. like ®)
3- then in another script, do some more transformations using XML::DOM and
4- print out resulting XML
My problem is that in the third step, when parsing its input, the
XML::Parser seems to resolve those references that contain the HEX
Unicode nb.; the "special name" references are not resolved.
My input looks somewhat like this:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE TEI.2 SYSTEM "E:/TEI.dtd">
<TEI.2>
<w:t>
ä NetMachanic®technical evaluation
</w:t>
<w:t>
âand LinkPopularity are two tools for organisation.
</w:t>
<w:t> ââââ </w:t>
<w:t> ®®®® </w:t>
</TEI.2>
I tried the option NoExpand and also implemented a default handler,
which "will be called when an entity reference is seen in text"
(http://www.socsci.umn.edu/ssrf/doc/xml/enno-xml-docs/users.erols.com/enno/xml/XML/Parser/Ex
pat.html),
so I have:
--------------------
#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;
my $infile = "INFILE.xml";
my $dom_parser = new XML::DOM::Parser(
NoExpand => 1,
Handlers => {
Default=> \&handle_default,
Char=> \&handle_char,
});
my $TREE = $dom_parser-> parsefile($infile);
# here transform $TREE with XML::DOM
open OUT, "> OUTFILE.xml" or die "cannot write to OUT file";
print OUT $TREE-> toString();
close OUT;
sub handle_char {
my ($parser, $string) = @_;
my $rec = $parser-> recognized_string();
my $esc = $parser-> xml_escape($rec);
open LOG, "> >log.txt";
print LOG "\n--\ncall of handle_char()\n";
print LOG "[$string||$rec//$esc]\n";
}
sub handle_default {
my ($parser, $string) = @_;
my $rec = $parser-> recognized_string();
my $esc = $parser-> xml_escape($rec);
open LOG, "> >log.txt";
print LOG "\n--\ncall of handle_default()\n";
print LOG "[$string||$rec//$esc]\n";
}
--------------------
Now, my problems:
First, handle_default() is not called for the entity references ®
and â but only for ä
® and â trigger handle_Char() instead.
Second, the NoExpand option does not what I thought it would, namely not
expand the entity references.
Finally, the unresolved string in handle_Char() can be seen in $rec and
$esc; the resolved one is in $string.
But how can I get this out to $TREE? All the textbook examples of
handlers I saw just printed out some message.
Another strange thing occurs in the last two <w:t> elements:
the first are four references to small letter a with circumflex; the
second one four references to the REGISTERED TRADEMARK SIGN. What I get
(when I don't set the Default and Char handlers) is:
<t> 㢃â </t> for the first and
<t> ®®®® </t> four (R) for the second
In the first case, resolving the reference â seems to "eat" some
of the following characters (also occurs when followed by normal
character text).
Could anyone please give advice? Thanks,
Alois
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Alois Heuboeck
Petr Cimprich
Alois Heuboeck
|