Re: libxml and (X)HTML documents
by Christian Glahn other posts by this author
Jul 11 2002 8:48AM messages near this date
view in the new Beta List Site
libxml and (X)HTML documents
|
Re: libxml and (X)HTML documents
On Wed, Jul 10, 2002 at 11:50:47PM -0400, Aaron Straup Cope wrote:
> Hi all,
>
> Can someone help me understand how exactly libxml deals with HTML file
> and, more specifically, XHTML files?
libxml2 has a special parser that is able to deal with HTML tags that
are never closed. this parser differs from the standard parser by
detecting special tags such as <p> , <img> that are commonly are just
opened. XML::LibXML provides a special interface to this parser
extension through the parse_html_* functions.
XHTML is a bit different, since it is a XML with a Language binding.
because of that you should parse them by using the common parse_*
functions, in order to indicate errors. because of the nature of
XHTML libxml2 does not provide any special interface to parse those
files (as far as i know).
so basicly libxml2 uses the same parser for XML and HTML data, where
the HTML parser is just a special case of the more general implementation
of the XML parser.
> I can understand treating HTML files as "special" but it appears that
> XHTML files are lumped in with the bad apples even though there isn't any
> reason for them to be.
to parse XHTML files, strings, handles ... use XML::LibXML's parse_file(),
parse_string() or parse_handle() instead of their parse_html_* relatives.
> If it's just another thing on the 'to-do' list then I can deal. But, I've
> had to jump through all kinds of hoops (see below) to get all the widgets
> used by, and including, XML::Filter::XSLT to munge one XHTML document into
> another in a SAX context.
from what i can see you do a bit too much work, but see below :)
> It's done so I'm happy enough but it seems completely nuts to have to go
> these lengths.
>
> Thanks,
>
> # in package Aaron::XML::Filter::XSLT
>
> sub end_document {
> my $self = shift;
>
> # because "IMA" XML::Filter::XSLT so calling
> # SUPER would make bad things happen
>
> my $dom = $self->XML::LibXML::SAX::Builder::end_document(@_);
ok, you get a XML::LibXML::Document here.
>
> # Gah! In a plain old XML::LibXSLT situation I can
> # call parse_html_file, but since ::SAX::Builder calls
> # $obj->createDocument() there doesn't seem to be anything
> # else but to do the following...
>
> my $parser = XML::LibXML->new();
because of havind a document already the following line is useless,
the different document types XML_DOCUMENT_NODE and XML_HTML_DOCUMENT_NODE
are basicly required for data output. i assume you don't really need
a separate parse step here.
*IMPORTANT*
i think this extra parse causes your headaches. so try to avoid it.
> $dom = $parser->parse_html_string($dom->toString());
>
> my $xslt = XML::LibXSLT->new();
> my $stylesheet = $xslt->parse_stylesheet($self->{StylesheetDOM});
for XSLT params you should remember to quote them for XSLT, but you may
already did so.
> my $results = $stylesheet->transform($dom,((ref($self->{'__params'})
> eq "ARRAY") ? @{$self->{'__params'}} : ()));
>
> # see earlier note to list on same subject [1]
> # this subclass basically does the following :
> # "You say HTML_DOCUMENT, I say X(HT)ML_DOCUMENT"
the document node type of the result node depends on the output type of the
XSL itself. from the core document structure they don't differ, so you don't
need to bother. especially since the current version of libxslt doesn't
support the output type xhtml.
> my $parser = Aaron::XML::LibXML::SAX::Parser->new(%$self);
if you use the following function as shipped by XML::LibXML, this should
work with XML as with HTML documents. therefore the SAX generation should work
fine.
> $parser->generate($results);
> }
i hope this helps you a bit.
christian
Thread:
Aaron Straup Cope
Christian Glahn
Mark Fowler
|