Re: how prevent XML::Parser from resolving entity references?
by Alois Heuboeck other posts by this author
Sep 16 2005 8:28AM messages near this date
view in the new Beta List Site
Re: how prevent XML::Parser from resolving entity references?
|
Perl XML project
& XSLT (I'd like to...)
> > 1- take an XML file
> > 2- in one script, replace everything above Unicode #x7F (end of ASCII)
> > with entity references (which can either have "special" names, like
> > ä or be based on the Unicode nb. like ®)
> > 3- then in another script, do some more transformations using XML::DOM
> > and
> > 4- print out resulting XML
> >
> > My problem is that in the third step, when parsing its input, the
> > XML::Parser seems to resolve those references that contain the HEX
> > Unicode nb.; the "special name" references are not resolved.
>
>
>
> Strings like ® are character references rather than entity
> references. A character reference is just an alternative way to express
> a character code point. Parsers make no difference between a character
> encoded with a specific encoding (such as utf-8) and the character
> reference. Your step 2 doesn't make much sense to me as XML works well
> with Unicode. What is the reason for it?
Petr
& other Perlers,
thanks for you reply.
I'm working on a linguistic corpus project. Some of the tools for which
texts of the corpus should be usable, are not Unicode-aware.
Basically, that means little more than that they cannot display it.
My thought was that by re-coding as a character reference, although we
still couldn't display it, at least the information would be retrievable
by having a look at the underlying XML file (look up the code point in a
Unicode table).
Does this make sense to you?
But then, I also encountered another problem when I skipped the phase of
re-coding:
I still have the script of step 2, which prints out the file after some
transformations:
-----------------------------------
#!/usr/bin/perl
use strict;
use warnings;
use encoding 'utf-8';
my $infile = "file1.xml";
my $outfile = "file2.xml";
print "OUT =:\n$outfile\n\n";
open IN, "$infile" or die "\ncannot read
specified infile\n$infile\n";
my $text = join "", <IN> ;
close IN;
# etc. etc.
# finally print it out
open OUT, "> $outfile" or die "cannot create out file";
# Alternatively, I tried this but it
# seems to make no difference:
# open OUT, "> :encoding(utf-8)", $outfile or die "cannot create out file";
print OUT $text;
close OUT;
-----------------------------------
Here's a snippet of the output I'm getting (here all text, no mark-up):
...to which Henri Bergson referred as "durée"; the way in which...
... which is OK.
Then, I open the file with the next script, parse it and print it out:
-----------------------------------
#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;
my $infile = "file2.xml";
my $outfile = "file3.xml";
my $dom_parser = new XML::DOM::Parser();
my $TREE = $dom_parser-> parsefile($infile);
# no adjacent text nodes
$TREE-> normalize();
open OUT, "> $outfile" or die "could not open outfile";
print OUT $TREE-> toString();
close OUT;
-----------------------------------
The line from above now looks like this:
...to which Henri Bergson referred as "dur㩥; the way in which...
I suspect that the parser interprets the IN stream as some wrong
encoding. But I really can't see how, I thought that both were UTF-8??
Best,
Alois
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Alois Heuboeck
Petr Cimprich
Alois Heuboeck
|