|
|
 |
|
Title: Lazy sorting
Submitter: Matteo Dell'Amico
(other recipes)
Last Updated: 2004/05/03
Version no: 1.1
Category:
Searching
|
|
2 vote(s)
|
|
|
|
Description:
This generator will sort only the requested part of a list. Useful for getting just the first few elements of a list and avoiding the overhead of sorting the whole thing.
Source: Text Source
import heapq
def isorted(iterable):
lst = list(iterable)
heapq.heapify(lst)
pop = heapq.heappop
while lst:
yield pop(lst)
Discussion:
This second version is by Raymond Hettiger, and is both simpler and faster than my original one, based on a generator-based quicksort.
The name isorted follows the convention from the itertools package and mimics the new python 2.4 "sorted" builtin.
Hope this is useful for someone, I needed more than once something like this.
|
|
Add comment
|
|
Number of comments: 6
generator-based quicksort, George Yoshida, 2004/04/28
I modified Matteo's original script so that isorted can take any iterable as an argument.
from itertools import ifilter, count
try:
from itertools import tee
except ImportError, err:
def tee(iterable):
def gen(next, data={}, cnt=[0]):
for i in count():
if i == cnt[0]:
item = data[i] = next()
cnt[0] += 1
else:
item = data.pop(i)
yield item
it = iter(iterable)
return (gen(it.next), gen(it.next))
def isorted(iterable):
iterable = iter(iterable)
try:
pivot = iterable.next()
except:
return
a, b = tee(iterable)
for x in isorted(ifilter(lambda item:item<pivot, a)):
yield x
yield pivot
for x in isorted(ifilter(lambda item:item>=pivot, b)):
yield x
Note : my generator-based quicksort does not construct a list.
I used itertools.tee, which will appear in Python 2.4.
Add comment
stable version, Daniel Harding, 2004/04/28
This version guarantees stability by selecting the first element as the pivot (this results in poor performance if the list is already sorted, but is no worse than the original version). Also, unlike the original version, it never modifies the list that is passed into it.
MIN_SIZE = 128
def isorted(lst):
if len(lst) <= MIN_SIZE:
lst = lst[:]
lst.sort()
for x in lst:
yield x
else:
pivot = lst[0]
lst = lst[1:]
lt = [x for x in lst if x < pivot]
for x in isorted(lt):
yield x
yield pivot
ge = [x for x in lst if x >= pivot]
for x in isorted(ge):
yield x
Add comment
Matteo Dell'Amico, 2004/04/29
George and Daniel, thanks for your comments. I did some test, and saw that the problem in all these versions is performance: python's builtin list.sort is so good that there are really few cases in which these solutions are better. The real performance bonus would be implementing this in C.
If someone who has "the right to speak" :-) thinks this is worth the itertools package, I'd be happy to study how to implement it.
Add comment
Simplification and speedup with heapq, Raymond Hettinger, 2004/05/01
In terms of use cases, lazy sorting is valuable only when you don't plan on consuming all of the output (otherwise, you might as well use a regular sort). Choosing the n smallest items out of a list would be a good application.
Keeping this in mind, a heap is a natural choice of underlying data structures:
import heapq
def isorted(iterable):
lst = list(iterable)
heapq.heapify(lst)
pop = heapq.heappop
while lst:
yield pop(lst)
isorted() is not an especially good candidate for itertools because it immediately manifests the whole input into memory and because most its use cases tend to be dominated by list.sort() and heapq.
Add comment
Matteo Dell'Amico, 2004/05/03
Raymond, you're absolutely right. Although, I argue that a well-written lazy sort could be almost as efficient as list.sort. Not that it would matter, since heapsort is just fine. I'm adopting your modification for version 2. :-)
Add comment
For 2.4 and later, consider using heapq's nsmallest or nlargest functions, Jim Baker, 2006/08/31
The timsort implementation used by sorted and sort is great. But laziness can still win out. In my experience, lazy sorting is significantly more efficient when a percentage of the query is being used as seen in top N type queries. Such queries very popular for user interfaces! Consequently I like the fact that 2.4 added this functionality in heapq.nlargest and heapq.nsmallest.
There should be no surprise that lazy sorting can be more efficient, we are comparing O(n + k logn) = O(n) -- if k is a constant as it is with top 50 -- vs O(n logn).
Of course that percentage will vary with the constant overhead you will see in your code. As usual, try it both ways if the sorting is in a key piece of code, as it was in mine.
Add comment
|
|
|
|
|
 |
|