ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: OpenOffice to xml and/or text (oo2txt)
Submitter: Dirk Holtwick (other recipes)
Last Updated: 2004/08/30
Version no: 1.1
Category: Text

 

5 stars 2 vote(s)


Description:

OpenOffice is very popular. Some people may be interested in indexing the contents of their documents written with OpenOffice. Here is a very simple solution for that.

Source: Text Source

# -*- coding: Latin-1 -*-

"""
Convert OpenOffice documents to XML and text

USAGE:
ooconvert [filename]
"""

import zipfile
import re
import sys

rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)

class ReadOO:

    def __init__(self, filename):
        zf = zipfile.ZipFile(filename, "r")
        self.data = zf.read("content.xml")
        zf.close()

    def getXML(self):
        return self.data

    def getData(self, collapse=1):
        return " ".join(rx_stripxml.sub(" ", self.data).split())

if __name__=="__main__":
    if len(sys.argv)>1:
        oo = ReadOO(sys.argv[1])
        print oo.getXML()
        print oo.getData()
    else:
        print __doc__.strip()

Discussion:

OpenOffice files are ZIP files and they always contain a file called "content.xml". We extract this one. In the method getData we throw away XML informations, split the result by blanks and then join them again to save space. This part could be done in a better way using an XML parser, but they often don't do what we expect them to do, so some help would be apreciated ;-)



Add comment

No comments.



Highest rated recipes:

1. A simple XML-RPC server

2. Web service accessible ...

3. Treat the Win32 Registry ...

4. Watching a directory ...

5. Union Find data structure

6. Function Decorators by ...

7. MS SQL Server log monitor

8. Table objects with ...

9. wx twisted support using ...

10. More accurate sum




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.