|
Description:
This code show how to use the relatively unknown LexicalHandler
interface, which is an extension to the standard SAX2 interfaces like
ContentHandler (we assume you already have some SAX2 know-how).
Source: Text Source
import sys
from xml.sax import sax2exts, saxutils, handler
from xml.sax import SAXNotSupportedException, SAXNotRecognizedException
class EchoGenerator(saxutils.XMLGenerator):
def __init__(self, out=None, encoding="iso-8859-1"):
saxutils.XMLGenerator.__init__(self, out, encoding)
self._in_entity = 0
self._in_cdata = 0
def characters(self, content):
if self._in_entity:
return
elif self._in_cdata:
self._out.write(content)
else:
saxutils.XMLGenerator.characters(self, content)
def comment(self, content):
self._out.write('<!--%s-->' % content)
def startDTD(self, name, public_id, system_id):
self._out.write('<!DOCTYPE %s' % name)
if public_id:
self._out.write(' PUBLIC %s %s' % (
saxutils.quoteattr(public_id),
saxutils.quoteattr(system_id)))
elif system_id:
self._out.write(' SYSTEM %s' % saxutils.quoteattr(system_id))
def endDTD(self):
self._out.write('>\n')
def startEntity(self, name):
self._out.write('&%s;' % name)
self._in_entity = 1
def endEntity(self, name):
self._in_entity = 0
def startCDATA(self):
self._out.write('<![CDATA[')
self._in_cdata = 1
def endCDATA(self):
self._out.write(']]>')
self._in_cdata = 0
def test(xmlfile):
parser = sax2exts.make_parser([
'pirxx',
'xml.sax.drivers2.drv_xmlproc',
'xml.sax.drivers2.drv_pyexpat',
])
print >>sys.stderr, "*** Using", parser
try:
parser.setFeature(handler.feature_namespaces, 1)
except (SAXNotRecognizedException, SAXNotSupportedException):
pass
try:
parser.setFeature(handler.feature_validation, 0)
except (SAXNotRecognizedException, SAXNotSupportedException):
pass
saxhandler = EchoGenerator()
parser.setContentHandler(saxhandler)
parser.setProperty(handler.property_lexical_handler, saxhandler)
parser.parse(xmlfile)
if __name__ == "__main__":
test('books.xml')
Discussion:
In addition to the standard SAX2 events, a LexicalHandler receives
events for things in an XML document that are not usually reported by a
SAX2 parser: comments, DTD information, entities and CDATA sections.
Thus, you can get at information otherwise hidden from you, which means
a read/modify/write application can reproduce a document much more
closely to its original representation than otherwise possible with
plain SAX2. The code just does that, it parses a file and does its best
to echo it unchanged to standard output.
You can pass a LexcialHandler instance to the parser by using the
"http://xml.org/sax/properties/lexical-handler" property.
Still, you lose some things, especially in the document leader (the part
of the document before the root element). A possible improvement is thus
to copy the document leader literally from the source file to the
output. This can be done by using a SAX2 locator, which tells you,
within the startDocument event, the exact location of the root element.
Using that information, you can copy the document leader verbatim, and
then append the document proper.
My tests using Python 2.1, PyXML 0.7 (from CVS) and PIRXX 1.2 indicate
that PIRXX (i.e. Xerces/C) reports all events, xmlproc leaves out the
start/end entity ones, and pyexpat misses those too, in addition to the
start/end DTD events.
|