Welcome, guest | Sign In | My Account | Store | Cart

Some file processing tasks are quite time consuming, especially when COM is involved. Unnecessary repetitions are then unbearable. Here is a module that helps avoid them.

Python, 177 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# *** Put this in "fcache.py" ***

import sys, os
try:
   import cPickle as pickle
except ImportError:
   import pickle

######################################################################
## 
## File Utilities
## 
######################################################################

def ftime(filename, default=-sys.maxint):
   if os.path.isfile(filename):
      return os.path.getmtime(filename)
   return default

def fisnewer(path, thenpath):
   return ftime(path) > ftime(thenpath)

def fstem(filename):
   return os.path.splitext(os.path.basename(filename))[0]

def fpickle(filename, content):
   if __debug__: print '==> pickling content to %s' % fstem(filename)
   file = open(filename, 'wb')
   try:
      try:
         pickle.dump(content, file, True)
      finally:
         file.close()
   except:
      os.remove(filename)
      raise

def funpickle(filename):
   if __debug__: print '==> unpickling content from %s' % fstem(filename)
   file = open(filename, 'rb')

   try:
      return pickle.load(file)
   finally:
      file.close()

######################################################################
## 
## Introspection Helpers
## 
######################################################################


def definingModuleFile(func):
   return func.func_code.co_filename

def definingModuleName(func):
   return fstem(definingModuleFile(func))

def qualifiedIdentifier(func):
   m = definingModuleName(func)
   try:
      c = func.im_class
   except AttributeError:
      return '%s.%s' % (m, func.__name__)
   else:
      return '%s.%s.%s' % (m, c.__name__, func.__name__)

def defaultCacheDirectory():
   return os.path.join(os.path.dirname(__file__), '@cache')

######################################################################
## 
## Encoding Functions used to generate a cache file name
## 
######################################################################

def hashhex(s):
   return hex(hash(s))[2:]

def md5hex(s):
   import md5
   return md5.new(s).hexdigest()

######################################################################
## 
## Cache Handling
## 
######################################################################

def cacheFileStem(func, args, encode=hashhex):
   id = encode(repr(args))
   return r'%s-%s' % (qualifiedIdentifier(func), id)

def shouldRebuild(target, sources):
   for each in sources:
      if fisnewer(each, target): return True
   return False

class CacheManager:
   def __init__(self, cacheDir=None, cacheFileExt='.cache', encode=hashhex):
      self._dir = cacheDir or defaultCacheDirectory()
      if not os.path.isdir(self._dir): os.makedirs(self._dir)
      self._ext = cacheFileExt
      self._encode = encode
      self._cachefiles = []
   def cacheFilePath(self, func, *args):
      filename = cacheFileStem(func, args, encode=self._encode)
      return os.path.join(self._dir, filename) + self._ext
   def deleteCacheFiles(self):
      for each in self._cachefiles[:]:
         if os.path.isfile(each):
            os.remove(each)
            self._cachefiles.remove(each)
   def wrap(self, func):
      def call(*sources):
         sources = map(os.path.abspath, sources)
         cachefile = self.cacheFilePath(func, *sources)
         if shouldRebuild(cachefile, sources):
            result = func(*sources)
            fpickle(cachefile, result)
            self._cachefiles.append(cachefile)
         else:
            result = funpickle(cachefile)
         return result
      return call


# --------------------------------------------------------------------
# *** Put this in a separate file, say "fcache_test.py" ***

if __name__ == '__main__':
   import sys
   import fcache

   # create source files
   for i in range(1, 5): open('file%d.txt' % i, 'wt').close()

   # define a processing function
   def processFiles(file1, file2):
      s = 'processing files %r and %r ...' % (file1, file2)
      print s
      return s

   # define a processing method
   class FileProcessor:
      def processFiles(self, file1, file2):
         return processFiles(file1, file2)

   processor = FileProcessor()

   # create a cache manager
   cm = fcache.CacheManager()

   # let's have a look at the generated cache file names
   print
   print fcache.cacheFileStem(processFiles, ('file1.txt', 'file2.txt'), encode=fcache.hashhex)
   print fcache.cacheFileStem(processor.processFiles, ('file1.txt', 'file2.txt'), encode=fcache.hashhex)

   # wrap the processing function
   f = cm.wrap(processFiles)

   # see what happens when a function is repeatedly again with the same arguments
   print
   result = f('file1.txt', 'file2.txt'); print 'result:', result
   result = f('file3.txt', 'file4.txt'); print 'result:', result
   result = f('file1.txt', 'file2.txt'); print 'result:', result
   result = f('file1.txt', 'file2.txt'); print 'result:', result

   # delete this sessions cache files, if you don't need them later
   # (normaly, you would leave them for later recycling)
   cm.deleteCacheFiles()

   # delete test source files
   for i in range(1, 5): os.remove('file%d.txt' % i)

# --------------------------------------------------------------------

The fcache module is concerned with avoiding unnecessary file processing operations (like parsing a file). It is assumed that such operations are implemented as functions or methods with the following signature: <pre>f(filename1, filename2, ...) -> object</pre> or <pre>f(*filenames) -> object</pre> The basic assumption made here is that the operation is time-invariant (state-less). In other words, the result of f depends on the input (source) files only, and not on any kind of internal state.

The mechanism used is caching. Each result is stored (pickled) into a file. The file name is derived from the function/method name and the arguments. If any of the files denoted by 'filenames' is newer then the cache file or the cache file does not exist, the function is called and the result is stored. If not, the result is recycled from the cache without invoking the function. (The approach partly mimics the strategy of build tools like make or ant.)

This, of course, only makes sense where function execution is significantly slower then the unpickling process itself. That is usually the case with: - complex text file parsing, involving for example many regular expression operations - processing of files in proprietary formats using COM (COM introduces a run-time overhead especially striking when many inter-process COM invocations are made, like with processing MS-Office files) - ...

This recipe has been used to speed up an automated testing process that involved analyzing office documents via COM to retrieve data for the tests.

About the implementation:

The class CacheManager is responsible for wrapping functions in cache handling code. This is what its primary method, wrap, does. It also keeps track of the created cache files internally, so that they can be deleted via deleteCacheFiles.

Upon instantiation, the cache file folder, the cache file name extensions and an encoding function other then the defaults may be specified.

The default cache folder is the subfolder "@cache" of the directory where the fcache.py module resides. This is so that, when the module is distributed with different applications (which usually go into different folders), each application will automatically get a cache folder of own.

The default cache file extension is ".cache" (not very inventive, I know).

The encoding function is used to produce a unique key from the function arguments, and defaults to fcache.hashhex. There is a second encoding function, fcache.md2hex, which is slower but has a much lower probability to generate the same key for different arguments.

The key is used to produce a unique cache file name for every combination of a function and its arguments. The cache file name is generated as follows: First, all source file names are converted to absolute paths via sources = map(os.path.abspath, sources).

Then the encoding function is invoked with repr(sources), and the generated key is prefixed with the functions qualified identifier, which is ModuleName.FunctionName for functions and ModuleName.ClassName.MethodName for methods. (In the future, this could be extended to include package names as well.)

A word of caution:

The module only compares the time stamp of the source files against that of the cache file to determine if a cached result is out of date. However, another reason why the saved result may become invalid is, if the implementation itself changes. While it would be easy to check if the operation defining module is newer than the cache file, the result may also depend on arbitrary other modules. A complete module dependency analysis is required for a rigid check, but that is beyond the scope of this recipe. Without it, my advice is to clear the cache repository (delete all files) after EVERY code change.

My current feature wish list (feel free to extend it):

  • Limitation of the repository size with auto-remove of least recently used entries
  • Build a check sum over the result and use it upon cache test to make the decision about result validity more reliable.
  • Include enclosing packages in the qualified identifier.

Cheers and happy caching!

[See Also]

Recipe "Memoizing (cacheing) function return values" by Paul Moore, and especially the coment on closures by Hannu Kankaanpää. It is here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52201