ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Re: utf-8 (or not) encoding question
by Joshua Santelli other posts by this author
Dec 10 2004 8:42PM messages near this date
view in the new Beta List Site
Re: utf-8 (or not) encoding question | Re: utf-8 (or not) encoding question
OK, another quick questions.  It looks like my IO was
the problem.  LibXML knew it was UTF-8 (at least
$source_xml-> encoding said so) but this character came
in as UTF-8 and out as Latin-1 here:

  print $fh $source_xml-> toString();

When I used LibXML's toFH: 

  my $rc = $source_xml-> toFH($fh);

that got it right (or maybe I got lucky).  I opened
the file handle with:

  my $fh = new FileHandle "> $xmlFile";

Do I really need to specify the UTF-8 encoding for
each file handle something like this?

  my $fh = new FileHandle "> :encoding(utf-8)
$xmlFile";

Can I trust that toFH() will do the right thing?  What
about XML::LibXML's toFile()?  I don't see much about
this in the perldoc.

Thanks again,
Josh


--- Martin Leese <geomatics@[...].com>  wrote:

>  >
>  > ubject:
>  > utf-8 (or not) encoding question
>  > From:
>  > Joshua Santelli <santellij@[...].com>
>  > Date:
>  > Thu, 9 Dec 2004 10:21:22 -0800 (PST)
>  > To:
>  > perl-xml@[...].com
>  >
>  > To:
>  > perl-xml@[...].com
>  >
>  >
>  >Hello,
>  >
>  >I'm using XML::LibXML to parse a file that I have. 
>  >The character in questions looks like one byte (F3)
>  >when I `less` the file on UNIX:
>  >
>  >analysis and algebraic topology, such as
>  >Calder<F3>n-Zygmund theory
>  >
>  >This is the error I get when I parse_file() the
>  file:
>  >  
>  >
>  ...
>  
>  >Is LibXML correct in thinking that this this is not
>  >UTF-8?  
>  >
>  Yes.
>  
>  >Is there an easy way for me to tell if this
>  >(or any file) is properly encoded as UFT-8?
>  >  
>  >
>  I believe you have found such a way.
>  
>  >What's wrong with F3 (&amp;#243;)?
>  >  
>  >
>  Nothing.  It simply isn't UTF-8 encoded.
>  
>  It is the ISO-8859-1 (Latin-1) encoding for a small
>  letter
>  o with acute.  This is Unicode point U+00F3.
>  
>  To see how to encode this codepoint as UTF-8, visit:
> 
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404
>  and look at Table 3-5.
>  
>  I calculate that the correct UTF-8 encoding for this
>  codepoint
>  would be the pair of bytes C3 B3.
>  
>  Regards,
>  Martin
>  
>  


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Martin Leese
Joshua Santelli
Dominic Mitchell

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved