|
Description:
This recipe can be used to sort big files (much bigger than the available RAM) according to a key. The sort is guaranteed to be stable on Python 2.3.
Source: Text Source
from heapq import heapify, heappop, heappush
from itertools import islice, cycle
from tempfile import gettempdir
import os
def merge(chunks,key=None):
if key is None:
key = lambda x : x
values = []
for index, chunk in enumerate(chunks):
try:
iterator = iter(chunk)
value = iterator.next()
except StopIteration:
try:
chunk.close()
os.remove(chunk.name)
chunks.remove(chunk)
except:
pass
else:
heappush(values,((key(value),index,value,iterator,chunk)))
while values:
k, index, value, iterator, chunk = heappop(values)
yield value
try:
value = iterator.next()
except StopIteration:
try:
chunk.close()
os.remove(chunk.name)
chunks.remove(chunk)
except:
pass
else:
heappush(values,(key(value),index,value,iterator,chunk))
def batch_sort(input,output,key=None,buffer_size=32000,tempdirs=[]):
if not tempdirs:
tempdirs.append(gettempdir())
input_file = file(input,'rb',64*1024)
try:
input_iterator = iter(input_file)
chunks = []
try:
for tempdir in cycle(tempdirs):
current_chunk = list(islice(input_iterator,buffer_size))
if current_chunk:
current_chunk.sort(key=key)
output_chunk = file(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
output_chunk.writelines(current_chunk)
output_chunk.flush()
output_chunk.seek(0)
chunks.append(output_chunk)
else:
break
except:
for chunk in chunks:
try:
chunk.close()
os.remove(chunk.name)
except:
pass
if output_chunk not in chunks:
try:
output_chunk.close()
os.remove(output_chunk.name)
except:
pass
return
finally:
input_file.close()
output_file = file(output,'wb',64*1024)
try:
output_file.writelines(merge(chunks,key))
finally:
for chunk in chunks:
try:
chunk.close()
os.remove(chunk.name)
except:
pass
output_file.close()
if __name__ == '__main__':
import optparse
parser = optparse.OptionParser()
parser.add_option(
'-b','--buffer',
dest='buffer_size',
type='int',default=32000,
help='''Size of the line buffer. The file to sort is
divided into chunks of that many lines. Default : 32,000 lines.'''
)
parser.add_option(
'-k','--key',
dest='key',
help='''Python expression used to compute the key for each
line, "lambda line:" is prepended.\n
Example : -k "line[5:10]". By default, the whole line is the key.'''
)
parser.add_option(
'-t','--tempdir',
dest='tempdirs',
action='append',
default=[],
help='''Temporary directory to use. You might get performance
improvements if the temporary directory is not on the same physical
disk than the input and output directories. You can even try
providing multiples directories on differents physical disks.
Use multiple -t options to do that.'''
)
parser.add_option(
'-p','--psyco',
dest='psyco',
action='store_true',
default=False,
help='''Use Psyco.'''
)
options,args = parser.parse_args()
if options.key:
options.key = eval('lambda line : (%s)'%options.key)
if options.psyco:
import psyco
psyco.full()
batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs)
Discussion:
The memory used is at most buffer_size * maximum length of line + a 64 kb buffer per chunk (there are number of lines in the file / buffer_size chunks). In practice one should try to provide big buffer_size numbers since the limiting factor is often the maximum numbers of file a process can open simultaneously. The merge process does a n-way merge between all the chunks, I should try to perform partial mergings to work around this limitation.
The Python Cookbook already contains a big file sort implementation (look for "big file sorting"), but I feel that this version is more terse and Pythonic.
Usage example :
sort.py -k "(line[5:10],int(line[10:16])" input.txt output.txt
This sort lexicographically on the text found at columns 5 to 9 inclusive, then numerically on the number found at column 10 to 15 inclusive.
Performance :
See by yourself :) I find this script quite competitive thanks to :
1) the awesome speed of Python's sort
2) the not less awesome speed of Python heapq
3) usage of file.writelines
4) a trick I've found to implement the merge sort, which is a variant of the Schwartzian transform wherein the iterator is included in the transformed tuple.
Update (1.1) : integrated changes by Ian Bicking.
Update (1.2) : added the possibility of specifying temporary directories, which can give a performance boost if they reside on different physical disks than the input or output files.
Update (1.3,1.4) : fixed typos.
Update (1.5) : This version now does its best to cleanup the temporary files even when an error is raised during the sort.
|
|
Add comment
|
|
Number of comments: 4
tweaks and times, Ian Bicking, 2006/01/17
You can do x.sort(key=None), so you don't have to test for "key is not None". Testing "if len(current_chunk)>0" is a long way of saying "if current_chunk". And the variable "position" in merge doesn't seem necessary -- the stability of the first .sort() should be sufficient.
It would be interesting to benchmark this against GNU sort, which I hear is really good at sorting large files. Obviously GNU sort will win, but by how much? Well, I can figure that out for myself I suppose; with a file with a million random lines:
$ time sort test_big.txt > tmp.txt
real 0m4.017s
user 0m2.584s
sys 0m0.396s
$ time python large_sort.py test_big.txt tmp.txt
real 0m8.436s
user 0m7.454s
sys 0m0.322s
Three times slower than GNU sort... which is actually a lot better than I expected.
Add comment
Ed Blake, 2006/01/17
Yeah, and how much did Python's startup time contribute to the difference?
Add comment
Thanks for the tips, Nicolas Lehuen, 2006/01/17
I've modified the recipe according to your remarks. You're right, adding a position field to the merged tuples is unneeded since the iterators used by the merging function are returning items in a stable way. The index field is needed, though, otherwise for equal key values we would risk that the iterators be used in a different order than the original one.
I've tested this script against GNU sort (textutils) 2.1 running under Win32, and it was quite competitive, only two to three times slower.
The point is that you get much more flexibility in the way you can define keys ; I wouldn't know how to easily have a two-part key with different semantics (lexicographic and numeric) using GNU sort. The key definition in the command line is only a start, you can define much more complicated keys.
Add comment
gotchas, cleanup, and common usage, Maris Fogels, 2006/02/27
A very useful recipe! But there is one 'gotcha' in v1.4. If the input file does not end with a newline, then the output file will combine the last input line with the next line in the sort sequence. This is because file.writelines() does not add line separators.
In my case, I am working with text files that end with '\x1a' instead of a newline. GNU sort will put the '\x1a' character on it's own line at the very beginning of the file, but the python version combines the first and second lines. I hacked in a workaround by checking for end-of-file using input_file.tell(), but it added 20% to the processing time :( I figure there must be a more elegant solution.
Another problem I encountered is that chunk files are not cleaned up if the chunk creation process is interrupted. This could happen when:
- the 'key' function throws an exception
- KeyboardInterrupt (the user cancels the program; I do this all the time)
- MemoryError (buffer_size is too large)
- IOError (the disk is full)
I timed the (modified) function against our previous implementation, which uses os.system():
$ timeit.py -s 'import batchsort' \
'batchsort.batch_sort("test.TXT", "sorted.txt", lambda line: (line[210:223]))'
10 loops, best of 3: 2.74e+05 usec per loop
$ timeit.py -s 'import os' \
'os.system("sort -k 1.210,1.223 test.TXT > sorted2.txt")'
10 loops, best of 3: 3.95e+05 usec per loop
Not bad, 30% faster on an athlon 800Mhz running Python 2.3, against a 2Mb file with 2200 rows. The difference is probably from the overhead of starting and monitoring the external process. Could someone verify this?
I would consider replacements for existing os.system() calls to be a more common case than command-line usage, but the time advantage may disappear since you will probably be calling 'sort' once on a large file, not 10 times on some small files.
One advantage this recipe has over GNU sort is that it is predictable. On Linux Fedora Core 3 GNU sort does some funny and annoying things because of the default locale settings. I prefer to use predictable pure python rather than checking and setting environment variables in sort's calling shell.
Add comment
|
|
|