ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: Using the SAX2 LexicalHandler Interface
Submitter: Jürgen Hermann (other recipes)
Last Updated: 2001/10/31
Version no: 1.0
Category: XML

 

Not Rated yet


Description:

This code show how to use the relatively unknown LexicalHandler
interface, which is an extension to the standard SAX2 interfaces like
ContentHandler (we assume you already have some SAX2 know-how).

Source: Text Source

# echoxml.py

import sys
from xml.sax import sax2exts, saxutils, handler
from xml.sax import SAXNotSupportedException, SAXNotRecognizedException

class EchoGenerator(saxutils.XMLGenerator):

    def __init__(self, out=None, encoding="iso-8859-1"):
        saxutils.XMLGenerator.__init__(self, out, encoding)
        self._in_entity = 0
        self._in_cdata = 0

    def characters(self, content):
        if self._in_entity:
            return
        elif self._in_cdata:
            self._out.write(content)
        else:
            saxutils.XMLGenerator.characters(self, content)

    # -- LexicalHandler interface

    def comment(self, content):
        self._out.write('<!--%s-->' % content)

    def startDTD(self, name, public_id, system_id):
        self._out.write('<!DOCTYPE %s' % name)
        if public_id:
            self._out.write(' PUBLIC %s %s' % (
                saxutils.quoteattr(public_id),
                saxutils.quoteattr(system_id)))
        elif system_id:
            self._out.write(' SYSTEM %s' % saxutils.quoteattr(system_id))

    def endDTD(self):
        self._out.write('>\n')

    def startEntity(self, name):
        self._out.write('&%s;' % name)
        self._in_entity = 1

    def endEntity(self, name):
        self._in_entity = 0

    def startCDATA(self):
        self._out.write('<![CDATA[')
        self._in_cdata = 1

    def endCDATA(self):
        self._out.write(']]>')
        self._in_cdata = 0


def test(xmlfile):
    parser = sax2exts.make_parser([
        'pirxx',
        'xml.sax.drivers2.drv_xmlproc',
        'xml.sax.drivers2.drv_pyexpat',
    ])
    print >>sys.stderr, "*** Using", parser

    try:
        parser.setFeature(handler.feature_namespaces, 1)
    except (SAXNotRecognizedException, SAXNotSupportedException):
        pass
    try:
        parser.setFeature(handler.feature_validation, 0)
    except (SAXNotRecognizedException, SAXNotSupportedException):
        pass

    saxhandler = EchoGenerator()
    parser.setContentHandler(saxhandler)
    parser.setProperty(handler.property_lexical_handler, saxhandler)
    parser.parse(xmlfile)


if __name__ == "__main__":
    test('books.xml')

Discussion:

In addition to the standard SAX2 events, a LexicalHandler receives
events for things in an XML document that are not usually reported by a
SAX2 parser: comments, DTD information, entities and CDATA sections.
Thus, you can get at information otherwise hidden from you, which means
a read/modify/write application can reproduce a document much more
closely to its original representation than otherwise possible with
plain SAX2. The code just does that, it parses a file and does its best
to echo it unchanged to standard output.

You can pass a LexcialHandler instance to the parser by using the
"http://xml.org/sax/properties/lexical-handler" property.

Still, you lose some things, especially in the document leader (the part
of the document before the root element). A possible improvement is thus
to copy the document leader literally from the source file to the
output. This can be done by using a SAX2 locator, which tells you,
within the startDocument event, the exact location of the root element.
Using that information, you can copy the document leader verbatim, and
then append the document proper.

My tests using Python 2.1, PyXML 0.7 (from CVS) and PIRXX 1.2 indicate
that PIRXX (i.e. Xerces/C) reports all events, xmlproc leaves out the
start/end entity ones, and pyexpat misses those too, in addition to the
start/end DTD events.



Add comment

No comments.



Highest rated recipes:

1. A simple XML-RPC server

2. Web service accessible ...

3. IPy Notify

4. Changing return value ...

5. Quantum Superposition

6. Pickle objects under ...

7. Generalized delegates ...

8. Reorder a sequence (uses ...

9. Setting Win32 System ...

10. ObjectMerger




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.