This simple method construct Python data structure from XML in one simple step. Data is accessed using the Pythonic "object.attribute" notation. See the discussion below for usage examples.
| Python |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | import re
import xml.sax.handler
def xml2obj(src):
"""
A simple function to converts XML data into native Python object.
"""
non_id_char = re.compile('[^_0-9a-zA-Z]')
def _name_mangle(name):
return non_id_char.sub('_', name)
class DataNode(object):
def __init__(self):
self._attrs = {} # XML attributes and child elements
self.data = None # child text data
def __len__(self):
# treat single element as a list of 1
return 1
def __getitem__(self, key):
if isinstance(key, basestring):
return self._attrs.get(key,None)
else:
return [self][key]
def __contains__(self, name):
return self._attrs.has_key(name)
def __nonzero__(self):
return bool(self._attrs or self.data)
def __getattr__(self, name):
if name.startswith('__'):
# need to do this for Python special methods???
raise AttributeError(name)
return self._attrs.get(name,None)
def _add_xml_attr(self, name, value):
if name in self._attrs:
# multiple attribute of the same name are represented by a list
children = self._attrs[name]
if not isinstance(children, list):
children = [children]
self._attrs[name] = children
children.append(value)
else:
self._attrs[name] = value
def __str__(self):
return self.data or ''
def __repr__(self):
items = sorted(self._attrs.items())
if self.data:
items.append(('data', self.data))
return u'{%s}' % ', '.join([u'%s:%s' % (k,repr(v)) for k,v in items])
class TreeBuilder(xml.sax.handler.ContentHandler):
def __init__(self):
self.stack = []
self.root = DataNode()
self.current = self.root
self.text_parts = []
def startElement(self, name, attrs):
self.stack.append((self.current, self.text_parts))
self.current = DataNode()
self.text_parts = []
# xml attributes --> python attributes
for k, v in attrs.items():
self.current._add_xml_attr(_name_mangle(k), v)
def endElement(self, name):
text = ''.join(self.text_parts).strip()
if text:
self.current.data = text
if self.current._attrs:
obj = self.current
else:
# a text only node is simply represented by the string
obj = text or ''
self.current, self.text_parts = self.stack.pop()
self.current._add_xml_attr(_name_mangle(name), obj)
def characters(self, content):
self.text_parts.append(content)
builder = TreeBuilder()
if isinstance(src,basestring):
xml.sax.parseString(src, builder)
else:
xml.sax.parse(src, builder)
return builder.root._attrs.values()[0]
|
Discussion
XML is a popular mean to encode data to share between systems. Despite its ubiquity, there is no straight forward way to translate XML to Python data structure. Traditional API like DOM and SAX often require undue amount of work to access the simplest piece of data.
This method convert XML data into a natural Pythonic data structure. For example:
<pre> >>> SAMPLE_XML = """<?xml version="1.0" encoding="UTF-8"?> ... <address_book> ... <person gender='m'> ... <name>fred</name> ... <phone type='home'>54321</phone> ... <phone type='cell'>12345</phone> ... <note>"A<!-- comment --><![CDATA[ <note>]]>"</note> ... </person> ... </address_book> ... """ >>> address_book = xml2obj(SAMPLE_XML) >>> person = address_book.person </pre>
To access its data, you can do the following:
<pre> person.gender -> 'm' # an attribute person['gender'] -> 'm' # alternative dictionary syntax person.name -> 'fred' # shortcut to a text node person.phone[0].type -> 'home' # multiple elements becomes an list person.phone[0].data -> '54321' # use .data to get the text value str(person.phone[0]) -> '54321' # alternative syntax for the text value person[0] -> person # if there are only one <person>, it can still # be used as if it is a list of 1 element. 'address' in person -> False # test for existence of an attr or child person.address -> None # non-exist element returns None bool(person.address) -> False # has any 'address' data (attr, child or text) person.note -> '"A <note>"' </pre> This function is inspired by David Mertz' Gnosis objectify utilities. The motivation of writing this recipe in its simplicity. With just 100 lines of code packaged into a single function, it can easily be embedded with other code for ease of distribution.


Comments
known issues.
A small nit. It should be noted that if your XML data has an attribute which is a Python keyword, this isn't going to work. For example, using "print" as an attribute is not going to work out well.
You could fix this with a little work, say, wrapping attributes in an XMLAttr class, or something. Or, you could simply map names like "print" to python attributes "_print". Or, you can simply accept that this is a limitation of this recipe. :-)
Overall, I think the second and third solutions are better than the first.
use dictionary syntax.
Support iteration. Fixed __getitem__() to better support iteration
Multiple items. One thing about this that I find concerning is the possibility of having a schema (just in the abstract sense -- some structure in mind) where some element can have multiple children of the same name, but where that number could just as easily be one. It seems like in this situation, any code that uses this recipe will have to check whether or not the value is a list every time it accesses such a structure.
Like, in your example -- the
phonetag. If I were using this to insert into a database, I'd always want to get the phone numbers as a list, even if there were only one. (And it seems pretty silly to assume that everyone will have at least two.) Also, what about the reverse -- you're only expecting one value for some element, but it's an improperly constructed file that gives multiple. I suppose you could solve both of these with isinstance() idioms on a case-by-case basis, but it seems like that would get tedious.Can you think of an elegant, Pythonic solution to this? Because I actually encounter this problem all the time parsing similar data structures (GET query-strings, INI-style configuration files, etc.) and I have yet to find a solution I'm completely happy with.
It becomes a list of 1. Hi Adam. I hear you. That't why it has some magic to treat a single element as a list of 1. For example there is only 1 person in this XML message. But you can do:
If you get the error: TypeError: 'DataNode' object does not support item assignment. A simple fix...
In rare cases, you may want to set an item back into the data structure.
This worked for me (add to DataNode and fix indentation problems)
def __setitem__(self, key, value):
self._attrs[key] = value
BTW, this is one of the best xml to object mapping snippets I have found. The array handling is particularly nice.
If you are a perl programmer looking for a Python equivalent of XML::Simple this is the closest I have seen.
Sign in to comment