RE: A whitespace issue in XML::LibXML
by Mark - BLS CTR Thomas other posts by this author
Jul 20 2007 9:23AM messages near this date
view in the new Beta List Site
Re: A whitespace issue in XML::LibXML
|
perl-ish way to deal with footnotes in an XML document
& XSLT Can you explain what you want a little more clearly? "Needs to be
isolated" isn't much to go on. Example desired output for the given
input would be nice.
- Mark.
> -----Original Message-----
> From: perl-xml-bounces@[...].com [mailto:perl-xml-
> bounces@[...].com] On Behalf Of Birgit Kellner
> Sent: Friday, July 20, 2007 11:35 AM
> To: perl-xml@[...].com
> Subject: A whitespace issue in XML::LibXML
>
> Here's a follow-up to my question from yesterday about parsing
> footnotes
> with XML::LibXML.
>
> This time, I'm parsing verse lines, so it's a different situation, but
> the replies to yesterday's query helped me to build a string out of
> the
> text nodes in a mixed-content node that is part of a larger structure.
>
> I'm recursively parsing from a higher level all the way down, always
> passing a node to a subroutine. If it is a text node, the subroutine
> appends its data content to a scalar and ends; if the node has
> children,
> the subroutine is called again.
>
> This is the code (minus a few attributes):
>
> <lg><note><span><l>
> <seg n="a">......</seg>
> <seg n="b">......</seg></l><l>
> <seg n="c">......</seg>
> <seg n="d">......</seg>
> </l></span><app>
> ...
> ...
> </app></note></lg>
>
> The text nodes are inside the <seg>-elements, which may also contain
> further <note>-elements, and whatnot.
>
> The <l>-elements are verse lines, and their data content needs to be
> isolated.
> This is done in two different ways: for the <l>-element containing
> <seg>s "a" and "b", the trigger is the beginning of <seg> "c". The
> scalar content is then copied and emptied out.
>
> For the <l>-element containing <seg>s "c" and "d", the routine checks
> if
> the last <seg> parsed was "d", if the node in question is a text node
> and has no next sibling. This means it's the last text node in the
> last
> segment of the verse, and thus the second verse line is complete.
>
> This last check runs into problems, however, when there is additional
> whitespace before the closing </lg>-tag:
>
> <lg><note><span><l>
> <seg n="a">......</seg>
> <seg n="b">......</seg></l><l>
> <seg n="c">......</seg>
> <seg n="d">......</seg>
> </l></span><app>
> ...
> ...
> </app></note>
> </lg>
>
> The newline after the closing </note> tag results in a text node with
> whitespace as its content. When the script arrives at that node, it
> logically determines that this is the final text node, has no
> siblings,
> and that the last <seg> parsed was "d".
>
> I can get around this problem by testing, in addition to checking on
> the
> "n"-attribute of the last parsed <seg> and the absence of a next
> sibling, whether the text node is actually inside <seg>
> ($el->findnodes('ancestor::seg') - need not necessarily be a parent).
>
> But still, I'm intrigued that this actually happens, and was wondering
> whether there is any way to make XML::LibXML ignore this whitespace.
>
>
>
> Birgit
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Perl-XML mailing list
> Perl-XML@[...].com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Birgit Kellner
Petr Pajas
Birgit Kellner
A. Pagaltzis
Vaclav Barta
Birgit Kellner
Richard E. Rathmann
Vaclav Barta
Petr Pajas
Vaclav Barta
Mark - BLS CTR Thomas
|