Re: [Tutor] parsing--is this right?
by Danny Yoo other posts by this author
Jun 11 2002 9:06PM messages near this date
[Tutor] zipfile module
|
Re: [Tutor] parsing--is this right?
> Okay, so from here how would you change this text? I would want it to
> look like this:
>
> <footnote> <emph>an italicized word</emph> <emph>maybe another
> italicized word</emph> text </footnote>
Ah! Let's say that the parsing went perfectly, and somehow we have a
function that transforms the string:
###
> >> EXAMPLE_TEXT = r"""{\footnote {\i an italicized word} {\i
... maybe another italicized word} text }""".replace('\n', ' ')
> >> parsed_doc = parse(tokenize(EXAMPLE_TEXT))
> >> parsed_doc
['\\footnote', ['\\i', 'an', 'italicized', 'word'], ['\\i', 'maybe',
'another', 'italicized', 'word'], 'text']
###
In our parsed document, each command is the first element in a list.
This representation is nice because it mirrors the nested nature of the
tags: lists can contain inner lists.
How do we go from EXAMPLE_TEXT to a marked-up string? We might want to
add XML tags for whenever we see a list. Here's an initial attempt to do
this:
###
def toXML(structure):
if type(structure) == type([]):
tag = structure[0][1:] ## secondary slice removes leading '/'
text_pieces = [str(s) for s in structure[1:]]
return "<%(tag)s> %(text)s</%(tag)s>" % { 'tag' : tag,
'text' : ''.join(text_pieces) }
else:
return str(structure)
###
Let's see how it works:
###
> >> print toXML(parsed_doc)
<footnote> ['\\i', 'an', 'italicized', 'word']['\\i', 'maybe', 'another',
'italicized', 'word']text</footnote>
###
Almost. The only problem is that it only transformed the very outer layer
of this onion, but we want to permeate the whole structure with tags.
Although we hate to hurt it's feelings, we have to say the truth: toXML()
is "shallow".
The section that transformed the stuff in between the tags was the
statement:
text_pieces = [str(s) for s in structure[1:]]
And this is the source of the shallowness: we're just calling str(). But
instead of directly calling str() on each piece in between, the trick is
to apply toXML() again to each inner piece! That way, we guarantee that
the inner lists are also transformed properly:
###
> >> def toXMLDeeply(structure):
... if type(structure) == type([]):
... tag = structure[0][1:] ## secondary slice removes leading '/'
... text_pieces = [toXML(s) for s in structure[1:]]
... return "<%(tag)s> %(text)s</%(tag)s>" % ... { 'tag' : tag,
... 'text' : ''.join(text_pieces) }
... else:
... return str(structure)
...
> >> print toXMLDeeply(parsed_doc)
<footnote> <i>anitalicizedword</i><i>maybeanotheritalicizedword</i>text</footnote>
###
The transformation wasn't perfect, because my parsing step had wiped out
whitespace in my parsed_doc, so that needs to be fixed. Still, it almost
works. *grin*
This is another example of a recursive function, so this might seem like a
weird little function at first.
I have to go at the moment, but I'll try looking at your other question
later. Talk to you soon!
_______________________________________________
Tutor maillist - Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Danny Yoo
Paul Tremblay
Danny Yoo
Danny Yoo
|