A whitespace issue in XML::LibXML
by Birgit Kellner other posts by this author
Jul 20 2007 8:36AM messages near this date
view in the new Beta List Site
Re: New to list
|
Re: A whitespace issue in XML::LibXML
& XSLT Here's a follow-up to my question from yesterday about parsing footnotes
with XML::LibXML.
This time, I'm parsing verse lines, so it's a different situation, but
the replies to yesterday's query helped me to build a string out of the
text nodes in a mixed-content node that is part of a larger structure.
I'm recursively parsing from a higher level all the way down, always
passing a node to a subroutine. If it is a text node, the subroutine
appends its data content to a scalar and ends; if the node has children,
the subroutine is called again.
This is the code (minus a few attributes):
<lg> <note><span><l>
<seg n="a"> ......</seg>
<seg n="b"> ......</seg></l><l>
<seg n="c"> ......</seg>
<seg n="d"> ......</seg>
</l> </span><app>
...
...
</app> </note></lg>
The text nodes are inside the <seg> -elements, which may also contain
further <note> -elements, and whatnot.
The <l> -elements are verse lines, and their data content needs to be
isolated.
This is done in two different ways: for the <l> -element containing
<seg> s "a" and "b", the trigger is the beginning of <seg> "c". The
scalar content is then copied and emptied out.
For the <l> -element containing <seg>s "c" and "d", the routine checks if
the last <seg> parsed was "d", if the node in question is a text node
and has no next sibling. This means it's the last text node in the last
segment of the verse, and thus the second verse line is complete.
This last check runs into problems, however, when there is additional
whitespace before the closing </lg> -tag:
<lg> <note><span><l>
<seg n="a"> ......</seg>
<seg n="b"> ......</seg></l><l>
<seg n="c"> ......</seg>
<seg n="d"> ......</seg>
</l> </span><app>
...
...
</app> </note>
</lg>
The newline after the closing </note> tag results in a text node with
whitespace as its content. When the script arrives at that node, it
logically determines that this is the final text node, has no siblings,
and that the last <seg> parsed was "d".
I can get around this problem by testing, in addition to checking on the
"n"-attribute of the last parsed <seg> and the absence of a next
sibling, whether the text node is actually inside <seg>
($el-> findnodes('ancestor::seg') - need not necessarily be a parent).
But still, I'm intrigued that this actually happens, and was wondering
whether there is any way to make XML::LibXML ignore this whitespace.
Birgit
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Birgit Kellner
Petr Pajas
Birgit Kellner
A. Pagaltzis
Vaclav Barta
Birgit Kellner
Richard E. Rathmann
Vaclav Barta
Petr Pajas
Vaclav Barta
Mark - BLS CTR Thomas
|