Re: utf-8 (or not) encoding question
by Joshua Santelli other posts by this author
Dec 10 2004 8:42PM messages near this date
view in the new Beta List Site
Re: utf-8 (or not) encoding question
|
Re: utf-8 (or not) encoding question
OK, another quick questions. It looks like my IO was
the problem. LibXML knew it was UTF-8 (at least
$source_xml-> encoding said so) but this character came
in as UTF-8 and out as Latin-1 here:
print $fh $source_xml-> toString();
When I used LibXML's toFH:
my $rc = $source_xml-> toFH($fh);
that got it right (or maybe I got lucky). I opened
the file handle with:
my $fh = new FileHandle "> $xmlFile";
Do I really need to specify the UTF-8 encoding for
each file handle something like this?
my $fh = new FileHandle "> :encoding(utf-8)
$xmlFile";
Can I trust that toFH() will do the right thing? What
about XML::LibXML's toFile()? I don't see much about
this in the perldoc.
Thanks again,
Josh
--- Martin Leese <geomatics@[...].com> wrote:
> >
> > ubject:
> > utf-8 (or not) encoding question
> > From:
> > Joshua Santelli <santellij@[...].com>
> > Date:
> > Thu, 9 Dec 2004 10:21:22 -0800 (PST)
> > To:
> > perl-xml@[...].com
> >
> > To:
> > perl-xml@[...].com
> >
> >
> >Hello,
> >
> >I'm using XML::LibXML to parse a file that I have.
> >The character in questions looks like one byte (F3)
> >when I `less` the file on UNIX:
> >
> >analysis and algebraic topology, such as
> >Calder<F3>n-Zygmund theory
> >
> >This is the error I get when I parse_file() the
> file:
> >
> >
> ...
>
> >Is LibXML correct in thinking that this this is not
> >UTF-8?
> >
> Yes.
>
> >Is there an easy way for me to tell if this
> >(or any file) is properly encoded as UFT-8?
> >
> >
> I believe you have found such a way.
>
> >What's wrong with F3 (&#243;)?
> >
> >
> Nothing. It simply isn't UTF-8 encoded.
>
> It is the ISO-8859-1 (Latin-1) encoding for a small
> letter
> o with acute. This is Unicode point U+00F3.
>
> To see how to encode this codepoint as UTF-8, visit:
>
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404
> and look at Table 3-5.
>
> I calculate that the correct UTF-8 encoding for this
> codepoint
> would be the pair of bytes C3 B3.
>
> Regards,
> Martin
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Martin Leese
Joshua Santelli
Dominic Mitchell
|