Re: A whitespace issue in XML::LibXML
by Petr Pajas other posts by this author
Jul 20 2007 9:26AM messages near this date
view in the new Beta List Site
A whitespace issue in XML::LibXML
|
Re: A whitespace issue in XML::LibXML
& XSLT On Friday 20 July 2007 17:34, Birgit Kellner wrote:
> Here's a follow-up to my question from yesterday about parsing
> footnotes with XML::LibXML.
>
> This time, I'm parsing verse lines, so it's a different
> situation, but the replies to yesterday's query helped me to
> build a string out of the text nodes in a mixed-content node that
> is part of a larger structure.
First of all, why do you do that "by hand"? To get all text nodes
from a subtree nicely concatenated, you can use e.g.
$text = $node-> findvalue('string(.)')
> I'm recursively parsing from a higher level all the way down,
> always passing a node to a subroutine. If it is a text node, the
> subroutine appends its data content to a scalar and ends; if the
> node has children, the subroutine is called again.
>
> This is the code (minus a few attributes):
>
> <lg><note><span><l>
> <seg n="a">......</seg>
> <seg n="b">......</seg></l><l>
> <seg n="c">......</seg>
> <seg n="d">......</seg>
> </l></span><app>
> ...
> ...
> </app></note></lg>
>
> The text nodes are inside the <seg>-elements, which may also
> contain further <note>-elements, and whatnot.
>
> The <l>-elements are verse lines, and their data content needs to
> be isolated.
> This is done in two different ways: for the <l>-element
> containing <seg>s "a" and "b", the trigger is the beginning of
> <seg> "c". The scalar content is then copied and emptied out.
>
> For the <l>-element containing <seg>s "c" and "d", the routine
> checks if the last <seg> parsed was "d", if the node in question
> is a text node and has no next sibling. This means it's the last
> text node in the last segment of the verse, and thus the second
> verse line is complete.
>
> This last check runs into problems, however, when there is
> additional whitespace before the closing </lg>-tag:
>
> <lg><note><span><l>
> <seg n="a">......</seg>
> <seg n="b">......</seg></l><l>
> <seg n="c">......</seg>
> <seg n="d">......</seg>
> </l></span><app>
> ...
> ...
> </app></note>
> </lg>
>
> The newline after the closing </note> tag results in a text node
> with whitespace as its content. When the script arrives at that
> node, it logically determines that this is the final text node,
> has no siblings, and that the last <seg> parsed was "d".
>
> I can get around this problem by testing, in addition to checking
> on the "n"-attribute of the last parsed <seg> and the absence of
> a next sibling, whether the text node is actually inside <seg>
> ($el->findnodes('ancestor::seg') - need not necessarily be a
> parent).
>
> But still, I'm intrigued that this actually happens, and was
> wondering whether there is any way to make XML::LibXML ignore
> this whitespace.
>
> Birgit
>
Without a DTD or schema, there is no reasonable way to distinguish
ignorable whitespace and non-ignorable whitespace in XML. You can
try setting $parser-> keep_blanks(0) before you parse the XML into
memory and see if it solves you problem.
-- Petr
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Birgit Kellner
Petr Pajas
Birgit Kellner
A. Pagaltzis
Vaclav Barta
Birgit Kellner
Richard E. Rathmann
Vaclav Barta
Petr Pajas
Vaclav Barta
Mark - BLS CTR Thomas
|