ActiveState Powered by ActiveState

Recipe 65128: Extract text from XML document


People often ask how to extract the text from an XML document. This small program does it.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from xml.sax.handler import ContentHandler
import xml.sax
import sys

class textHandler(ContentHandler):
    def characters(self, ch):
        sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser()
handler = textHandler()
parser.setContentHandler(handler)
parser.parse("test.xml")

Discussion

Sometimes you want to get rid of XML tags to re-key a document, or to spell check it. This will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the xml lexer described in another recipe called "XML lexing".

Comments

  1. 1. At 9:25 a.m. on 21 mar 2004, Bill Bell said:
  2. 2. At 9:46 a.m. on 21 mar 2004, Bill Bell said:

    Another way. from sgmllib import SGMLParser

    class XMLJustText ( SGMLParser ) : def handle_data ( self, data ) : print data

    XMLJustText ( ) . feed ( "text 1text 2" )

Sign in to comment