Re: Keeping numeric entities intact when parsing and serializing?
by Petr Pajas other posts by this author
Oct 9 2009 4:22PM messages near this date
view in the new Beta List Site
Re: Keeping numeric entities intact when parsing and serializing?
|
Re: Keeping numeric entities intact when parsing and serializing?
& XSLT 2009/10/10 Nicolas Mendoza <mendoza@[...].no> :
> On Fri, 09 Oct 2009 23:58:53 +0200, Petr Pajas <pajas@[...].cz>
> wrote:
>
> > 2009/10/9 Nicolas Mendoza <mendoza@[...].no>:
> >>
> >> On Fri, 09 Oct 2009 22:51:32 +0200, Aristotle Pagaltzis
> >> <pagaltzis@[...].de>
> >> wrote:
> >>
> >>> * Nicolas Mendoza <mendoza@[...].no> [2009-10-09 17:55]:
> >>>>
> >>>> Is there some way to keep the entities intact when parsing and
> >>>> serlalizing numeric entities?
> >>>
> >>> Why would you want such a thing?
> >>>
> >>
> >> Because I'm feeding it data and I want it to come out the same way? Just
> >> like & does. (I want to distinguish an incoming ' and "'". So I
> >> don't want it to alter my valid XML data, basically.)
> >>
> >> Actually it's a bit surprising that libxml2 does that.
> >
> > Huh? There is absolutly nothing surprising about that, it is an XML
> > parser? Programming XML would be hell if XML parsers didn't do this.
> >>
> >> From the XML point of view, ' and ' are the same thing! Read the
> >
> > XML spec.
> >
> > To be fair, there are few cases when particular formatting matters,
> > but those are (supposed to be) dealt with by XML C14N (and possibly
> > XML Encryption and XML Signature).
> >
>
> I think I can sympathize with your sentiments, but I'm not sure I can agree
> 100% that altering in-data is the optimal way of functioning.
The data is the charcters, not the way they are serialized in XML.
What you want is of the same nature as asking the parser to remember
the original whitespace within XML tags, e.g. to distinguish
<foo bar="baz"/>
from
<foo
bar="baz"
/>
No widely-adopted XML API can preserve this distinction, nor it can
preserve the distinction between a character and the corresponding
numerical entity. XML APIs typically exchange content, not
representation.
> No matter if they are the same in theory.
What theory that would be? I'm not talking about theories, I'm talking
about the XML 1.0 spec.
> Why would it be hell if XML parsers didn't convert numeric entities to
> ASCII, UTF-8 (or whatever charset is possible/available at the time) on
> serialization?
It would be hell because it would be like having no parser at all.
Also note that parsers don't "convert" anything on serialization, but
already during parse. In fact, parsers don't serialize (serializers
do); they parse, i.e. read the input and decode the content into some
structured form, passing it via some API to a handler (possibly a
serializer or an application). Content, not the way it is encoded
(alghough, to be fair, some do parsers send the original as well, e.g.
via offsets to the source stream; from the implementation point of
view this may be costly, from the practical point of view it is seldom
useful).
So to sum up: if you don't want a parser to parse your input, then
don't use one! Just process the XML as text (e.g. using regexps,
lexer, tokenizer, or a parser that gives you real low-level access),
since apparently you are not interested in the content of the XML
document but one particular textual representation of the content in
XML.
-- Petr
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Nicolas Mendoza
Mirod
Jenda Krynicky
Nicolas Mendoza
Aristotle Pagaltzis
Nicolas Mendoza
Aristotle Pagaltzis
Petr Pajas
Nicolas Mendoza
Petr Pajas
Nicolas Mendoza
Aristotle Pagaltzis
Nicolas Mendoza
Nicolas Mendoza
Petr Pajas
|