ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Re: how prevent XML::Parser from resolving entity references?
by Alois Heuboeck other posts by this author
Sep 16 2005 8:28AM messages near this date
view in the new Beta List Site
Re: how prevent XML::Parser from resolving entity references? | Perl XML project
& XSLT (I'd like to...)
> > 1- take an XML file
> > 2- in one script, replace everything above Unicode #x7F (end of ASCII) 
> > with entity references (which can either have "special" names, like 
> > ä or be based on the Unicode nb. like ®)
> > 3- then in another script, do some more transformations using XML::DOM 
> > and
> > 4- print out resulting XML
> >
> > My problem is that in the third step, when parsing its input, the 
> > XML::Parser seems to resolve those references that contain the HEX 
> > Unicode nb.; the "special name" references are not resolved.
>  
>  
>  
>  Strings like ® are character references rather than entity 
>  references. A character reference is just an alternative way to express 
>  a character code point. Parsers make no difference between a character 
>  encoded with a specific encoding (such as utf-8) and the character 
>  reference. Your step 2 doesn't make much sense to me as XML works well 
>  with Unicode. What is the reason for it?


Petr
& other Perlers,

thanks for you reply.
I'm working on a linguistic corpus project. Some of the tools for which 
texts of the corpus should be usable, are not Unicode-aware.
Basically, that means little more than that they cannot display it.
My thought was that by re-coding as a character reference, although we 
still couldn't display it, at least the information would be retrievable 
by having a look at the underlying XML file (look up the code point in a 
Unicode table).

Does this make sense to you?


But then, I also encountered another problem when I skipped the phase of 
re-coding:
I still have the script of step 2, which prints out the file after some 
transformations:

-----------------------------------
	#!/usr/bin/perl

	use strict;
	use warnings;
	use encoding 'utf-8';

	my $infile = "file1.xml";
	my $outfile = "file2.xml";

	print "OUT =:\n$outfile\n\n";

	open IN, "$infile" or die "\ncannot read
	specified infile\n$infile\n";
	my $text = join "", <IN> ;
	close IN;

	# etc. etc.

	# finally print it out

	open OUT, "> $outfile" or die "cannot create out file";

	# Alternatively, I tried this but it
	# seems to make no difference:
	# open OUT, "> :encoding(utf-8)", $outfile or die "cannot create out file";

	print OUT $text;
	close OUT;
-----------------------------------


Here's a snippet of the output I'm getting (here all text, no mark-up):

	...to which Henri Bergson referred as "durée"; the way in which...

... which is OK.
Then, I open the file with the next script, parse it and print it out:

-----------------------------------
	#!/usr/bin/perl
	use strict;
	use XML::DOM;
	use warnings;

	my $infile = "file2.xml";
	my $outfile = "file3.xml";

	my $dom_parser = new XML::DOM::Parser();
	my $TREE = $dom_parser-> parsefile($infile);

	# no adjacent text nodes
	$TREE-> normalize();

	open OUT, "> $outfile" or die "could not open outfile";
	print OUT $TREE-> toString();
	close OUT;
-----------------------------------


The line from above now looks like this:

	...to which Henri Bergson referred as "dur&#14949;; the way in which...

I suspect that the parser interprets the IN stream as some wrong 
encoding. But I really can't see how, I thought that both were UTF-8??

Best,
Alois

_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Alois Heuboeck
Petr Cimprich
Alois Heuboeck

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved