ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Re: A whitespace issue in XML::LibXML
by Petr Pajas other posts by this author
Jul 20 2007 9:26AM messages near this date
view in the new Beta List Site
A whitespace issue in XML::LibXML | Re: A whitespace issue in XML::LibXML
& XSLT On Friday 20 July 2007 17:34, Birgit Kellner wrote:
>  Here's a follow-up to my question from yesterday about parsing
>  footnotes with XML::LibXML.
> 
>  This time, I'm parsing verse lines, so it's a different
>  situation, but the replies to yesterday's query helped me to
>  build a string out of the text nodes in a mixed-content node that
>  is part of a larger structure.

First of all, why do you do that "by hand"? To get all text nodes 
from a subtree nicely concatenated, you can use e.g.

$text = $node-> findvalue('string(.)')

>  I'm recursively parsing from a higher level all the way down,
>  always passing a node to a subroutine. If it is a text node, the
>  subroutine appends its data content to a scalar and ends; if the
>  node has children, the subroutine is called again.
> 
>  This is the code (minus a few attributes):
> 
>  <lg><note><span><l>
>  <seg n="a">......</seg>
>  <seg n="b">......</seg></l><l>
>  <seg n="c">......</seg>
>  <seg n="d">......</seg>
>  </l></span><app>
>  ...
>  ...
>  </app></note></lg>
> 
>  The text nodes are inside the <seg>-elements, which may also
>  contain further <note>-elements, and whatnot.
> 
>  The <l>-elements are verse lines, and their data content needs to
>  be isolated.
>  This is done in two different ways: for the <l>-element
>  containing <seg>s "a" and "b", the trigger is the beginning of
>  <seg> "c". The scalar content is then copied and emptied out.
> 
>  For the <l>-element containing <seg>s "c" and "d", the routine
>  checks if the last <seg> parsed was "d", if  the node in question
>  is a text node and has no next sibling. This means it's the last
>  text node in the last segment of the verse, and thus the second
>  verse line is complete.
> 
>  This last check runs into problems, however, when there is
>  additional whitespace before the closing </lg>-tag:
> 
>  <lg><note><span><l>
>  <seg n="a">......</seg>
>  <seg n="b">......</seg></l><l>
>  <seg n="c">......</seg>
>  <seg n="d">......</seg>
>  </l></span><app>
>  ...
>  ...
>  </app></note>
>  </lg>
> 
>  The newline after the closing </note> tag results in a text node
>  with whitespace as its content. When the script arrives at that
>  node, it logically determines that this is the final text node,
>  has no siblings, and that the last <seg> parsed was "d".
> 
>  I can get around this problem by testing, in addition to checking
>  on the "n"-attribute of the last parsed <seg> and the absence of
>  a next sibling, whether the text node is actually inside <seg>
>  ($el->findnodes('ancestor::seg') - need not necessarily be a
>  parent).
> 
>  But still, I'm intrigued that this actually happens, and was
>  wondering whether there is any way to make XML::LibXML ignore
>  this whitespace.
> 
>  Birgit
> 

Without a DTD or schema, there is no reasonable way to distinguish 
ignorable whitespace and non-ignorable whitespace in XML. You can 
try setting $parser-> keep_blanks(0) before you parse the XML into 
memory and see if it solves you problem.

-- Petr
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Birgit Kellner
Petr Pajas
Birgit Kellner
A. Pagaltzis
Vaclav Barta
Birgit Kellner
Richard E. Rathmann
Vaclav Barta
Petr Pajas
Vaclav Barta
Mark - BLS CTR Thomas

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved