ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Re: [ANNOUNCE] XML::LibXML 1.63
by Petr Pajas other posts by this author
May 4 2007 12:53AM messages near this date
view in the new Beta List Site
Re: [ANNOUNCE] XML::LibXML 1.63 | Re: [ANNOUNCE] XML::LibXML 1.63
& XSLT On Friday 04 May 2007, Bruce Miller wrote:
>  Petr Pajas wrote:
>  > On Thursday 03 May 2007 17:55, Bruce Miller wrote:
>  >> Petr Pajas wrote:
>  >>> On Thursday 03 May 2007, Bruce Miller wrote:
>  >>>> Petr Pajas wrote:
>  >>>>> Hi,
>  >>>>
>  >>>> Hi Petr;
>  >>>>   First off, thanks for maintaining the library, but...
>  >>>>
>  >>>> I was slow in catching the impact here:
>  >>>>> Changes over 1.62:
>  >>>>> 1.63
>  >>>>
>  >>>> ...
>  >>>>
>  >>>>>    - $doc->toString always returns octets
>  >>>>
>  >>>> Is this a good thing?
>  >>>> Don't I want a String of Characters??
>  >>>> (Or why would I think that I would...)
>  >>>> Doesn't this mean I need to wrap
>  >>>>   Encode::decode("utf-8",$doc->toString)
>  >>>> I can't quite get $doc->setEncoding("utf-8");
>  >>>> to get the old behavior back...
>  >>>> I'm missing a $doc->setActualEncoding("utf-8");
>  >>>
>  >>> Read Aristotle's explanation as well.
>  >>
>  >> Yes; I've responded to that message, too.
>  >>
>  >>> These now only affect the encoding of the resulting data, not
>  >>> the chars  vs. bytes semantics. This will just ensure that the
>  >>> data returned by $doc->toString or saved by $doc->toFH/toFile
>  >>> will be encoded in UTF-8 with the corresponding encoding
>  >>> declarationin <?xml ...?> (just as for any other of the many
>  >>> encodings supported by libxml2).
>  >>>
>  >>>> Or, do I just (still, again, ...) misunderstand
>  >>>> Perl's unicode handling?
>  >>>
>  >>> seems many people do:-)
>  >>>
>  >>> Consider this:
>  >>>
>  >>> #!/usr/bin/perl
>  >>> use XML::LibXML;
>  >>> print XML::LibXML->new->parse_file(shift)->toString(1);
>  >>>
>  >>> This script is *broken* with <= 1.62!
>  >>>
>  >>> Even this most elementary usage of the API won't work with 1.62
>  >>> and it is quite tricky to work around! What's wrong with it?
>  >>>
>  >>> If the document $ARGV[0] is in UTF-8, you'll just get the "wide
>  >>> characters on output" warning; if lucky, the output itself will
>  >>> be ok. You may try to fix the script with e.g. binmode STDOUT,
>  >>> ':utf8' to get rid of the warning, then UTF-8 encoded documents
>  >>> will work, but you'll get into big troubles with other
>  >>> encodings then.
>  >>
>  >> Indeed; I stumbled onto this shortly after posting.
>  >> Unfortunately, all my xml scripts are now littered with
>  >> binmode hacks which break instead of fix the code...
>  >> There's a little bit of dilemma since I've typically
>  >> set the binmode on _both_ STDOUT and STDERR
>  >> so that both the "good output", and error messages
>  >> would come out right; I suspect it will sort itself out.
>  >
>  > ok, I take it as you agree this change is indeed an improvement;-)
> 
>  Perhaps; or at least it's making the best of a bad
>  situation, namely perl's confusion of strings with
>  arrays of bytes.  If I had my druthers, that binmode
>  stuff would only be needed when you do _not_ use utf8,
>  rather than when you do.
> 
>  But if your
>    print $doc->toString
>  example is important, then, wouldn't it be more consistent
>  to have the default encoding (for an xml doc w/o an
>  encoding declaration) to correspond to the locale,
>  rather than utf8 ? 

I don't think so. And in fact, currently libxml2 serializes documents without 
encoding declarations as ASCII with all non-ascii characters encoded via 
character entities. And libxml2 doesn't add encoding declaration where there 
is none. If you wish for locale encoded XML, encode it so yourself. Not all 
the use cases are the same. And the use of 'print' in my example is just an 
example. I might as well send the result to a socket.

>  After all, we've already established 
>  that the result of toString isn't intended to be
>  a string of perl characters, but rather just binary.

Yes, a serialized document is a binary stream which declares its own encoding; 
if it does not, it must be UTF8 or UTF16 (with BOM). Libxml2 produces ASCII 
in that case, which is a subset of UTF8.

>  Also, this change brings up a bunch of other questions.
>  I expect that they're all done consistently, but the
>  documentation doesn't (always) say explicitly.
>  What type of data do ->textContent, ->nodeValue return?

character strings, of course (don't confuse serialization and data model - the 
content of a DOM is textual, not binary!).

>  What type of data should be passed to ->appendText (and variants)?

character strings (as above)

The API can handle data encoded in the original encoding too, but it is rather 
a relict from old times and I do not promote using it. Every time you think 
of using it, think twice and don't;-)! 

>  If it's a string-of-perl-utf8-chars will it be converted
>  to the document's encoding ?

no need to, you answered yourself:-), details follow

>  [Hmm, but doesn't libxml2 store all data internally
>  as utf8? So, maybe it doesn't matter...]

Exactly, libxml2 has internally all data in UTF8, so (due to a lucky 
coincidence that Perl uses UTF8 for strings too) no conversion is needed when 
passing/pulling strings to/from DOM. (The only exception is when you pass 
binary data in the document-encoding, which XML::LibXML has to reencode back 
to UTF8 before handing them down to libxml2).

The conversion from internal representation of strings to document encoding is 
done only when serializing using $doc-> toString.

-- Petr
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Petr Pajas
Bruce Miller
Petr Pajas
Bruce Miller
Petr Pajas
Bruce Miller
A. Pagaltzis
Bruce Miller
A. Pagaltzis
Bruce Miller
Peter J. Holzer
A. Pagaltzis
A. Pagaltzis
Petr Pajas
A. Pagaltzis
Bruce Miller
A. Pagaltzis
Bruce Miller

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved