ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
A whitespace issue in XML::LibXML
by Birgit Kellner other posts by this author
Jul 20 2007 8:36AM messages near this date
view in the new Beta List Site
Re: New to list | Re: A whitespace issue in XML::LibXML
& XSLT Here's a follow-up to my question from yesterday about parsing footnotes 
with XML::LibXML.

This time, I'm parsing verse lines, so it's a different situation, but 
the replies to yesterday's query helped me to build a string out of the 
text nodes in a mixed-content node that is part of a larger structure.

I'm recursively parsing from a higher level all the way down, always 
passing a node to a subroutine. If it is a text node, the subroutine 
appends its data content to a scalar and ends; if the node has children, 
the subroutine is called again.

This is the code (minus a few attributes):

<lg> <note><span><l>
<seg n="a"> ......</seg>
<seg n="b"> ......</seg></l><l>
<seg n="c"> ......</seg>
<seg n="d"> ......</seg>
</l> </span><app>
...
...
</app> </note></lg>

The text nodes are inside the <seg> -elements, which may also contain 
further <note> -elements, and whatnot.

The <l> -elements are verse lines, and their data content needs to be 
isolated.
This is done in two different ways: for the <l> -element containing 
<seg> s "a" and "b", the trigger is the beginning of <seg> "c". The 
scalar content is then copied and emptied out.

For the <l> -element containing <seg>s "c" and "d", the routine checks if 
the last <seg>  parsed was "d", if  the node in question is a text node 
and has no next sibling. This means it's the last text node in the last 
segment of the verse, and thus the second verse line is complete.

This last check runs into problems, however, when there is additional 
whitespace before the closing </lg> -tag:

<lg> <note><span><l>
<seg n="a"> ......</seg>
<seg n="b"> ......</seg></l><l>
<seg n="c"> ......</seg>
<seg n="d"> ......</seg>
</l> </span><app>
...
...
</app> </note>
</lg> 

The newline after the closing </note>  tag results in a text node with 
whitespace as its content. When the script arrives at that node, it 
logically determines that this is the final text node, has no siblings, 
and that the last <seg>  parsed was "d".

I can get around this problem by testing, in addition to checking on the 
"n"-attribute of the last parsed <seg>  and the absence of a next 
sibling, whether the text node is actually inside <seg>  
($el-> findnodes('ancestor::seg') - need not necessarily be a parent).

But still, I'm intrigued that this actually happens, and was wondering 
whether there is any way to make XML::LibXML ignore this whitespace.



Birgit










_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Birgit Kellner
Petr Pajas
Birgit Kellner
A. Pagaltzis
Vaclav Barta
Birgit Kellner
Richard E. Rathmann
Vaclav Barta
Petr Pajas
Vaclav Barta
Mark - BLS CTR Thomas

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved