|
|
 |
|
Title: bag collection class
Submitter: Raymond Hettinger
(other recipes)
Last Updated: 2004/11/30
Version no: 1.5
Category:
Algorithms
|
|
2 vote(s)
|
|
|
|
Description:
Implement Smalltalk's bag class (similar to MultiSets in C++ or bag's in Objective C or Haskell's Edison module).
Source: Text Source
from operator import itemgetter
from heapq import nlargest
class bag(object):
def __init__(self, iterable=()):
self._data = {}
self._len = 0
self.update(iterable)
def update(self, iterable):
if isinstance(iterable, dict):
for elem, n in iterable.iteritems():
self[elem] += n
else:
for elem in iterable:
self[elem] += 1
def __contains__(self, elem):
return elem in self._data
def __getitem__(self, elem):
return self._data.get(elem, 0)
def __setitem__(self, elem, n):
self._len += n - self[elem]
self._data[elem] = n
if n == 0:
del self._data[elem]
def __delitem__(self, elem):
self._len -= self[elem]
del self._data[elem]
def __len__(self):
assert self._len == sum(self._data.itervalues())
return self._len
def __eq__(self, other):
if not isinstance(other, bag):
return False
return self._data == other._data
def __ne__(self, other):
if not isinstance(other, bag):
return True
return self._data != other._data
def __hash__(self):
raise TypeError
def __repr__(self):
return 'bag(%r)' % self._data
def copy(self):
return self.__class__(self)
__copy__ = copy
def __deepcopy__(self, memo):
from copy import deepcopy
result = self.__class__()
memo[id(self)] = result
data = result._data
result._data = deepcopy(self._data, memo)
result._len = self._len
return result
def __getstate__(self):
return self._data.copy(), self._len
def __setstate__(self, data):
self._data = data[0].copy()
self._len = data[1]
def clear(self):
self._data.clear()
self._len = 0
def __iter__(self):
for elem, cnt in self._data.iteritems():
for i in xrange(cnt):
yield elem
def iterunique(self):
return self._data.iterkeys()
def itercounts(self):
return self._data.iteritems()
def mostcommon(self, n=None):
if n is None:
return sorted(self.itercounts(), key=itemgetter(1), reverse=True)
it = enumerate(self.itercounts())
nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
return [(elem, cnt) for cnt, i, elem in nl]
Discussion:
I'm proposing this API for inclusion in a Py2.5 collections module. Please comment if you see any way the API can be improved.
Example:
>>> from collections import bag
>>> b = bag('banana')
>>> b.mostcommon()
[('a', 3), ('n', 2), ('b', 1)]
>>> list(b)
['a', 'a', 'a', 'b', 'n', 'n']
>>> set(b.iterunique()) # faster than set(b)
set(['a', 'b', 'n'])
>>> b['a']
3
>>> b['a'] += 4
>>> b['a']
7
>>> del b['a']
>>> b['a'] += 1
>>> b['a']
1
Reposted after incorporating most of the suggestions below.
|
|
Add comment
|
|
Number of comments: 15
subscripting?, Matt R, 2004/01/16
Finding this recipe has just saved me an hour or so -- thanks!
API: How about using subscripting to access the counts?
>>> b = bag('banana')
>>> b['a']
3
Add comment
API, Not specified Not specified, 2004/01/17
I agree with previous comment from Matt R and prefer subscripting interface so i can:
>>> mybag = bag("abacab")
>>> mybag['d'] += 1 #automatically adds 'a' to bag if doesn't exist
>>> mybag['c'] -= 2 #clear (or throw exception) if less than 2 'c's in bag
>>> mybag['a'] = 0 #removes all 'a' from bag
>>> mybag['e'] #queries bag for the count of 'e'
0
This gets rid of count, add, remove methods.
Since I inheret my bag from dict, I redefine pop() to both return and remove all of a given item of bag. Also since I have to redefine dict.__setitem__() anyway, I use that opportunity to validate the dict(non-zero integer) values before insertion. This also makes it easy for subclasses to redefine the policy of how the bag should behave (like in example #2 above). (Alternatively, one could pass in a policy function or "filter" to the bag constructor.)
I also allow add/subtract arithmetic on bags:
>>> bag("word counts from file one".split()) + bag("word counts from another file".split())
bag({"word": 2, "counts": 2, "from": 2, "another": 1, "file": 2, "one": 1})
Mike
Add comment
addendum, Not specified Not specified, 2004/01/18
I should add that the suggested API allows me to easily create an weighted graph class as a dict of bags with a nice API for the graph:
>>> g = graph()
>>> g['here']['there'] #returns edge weight
>>> g['here']['newthere'] = 1 #add new vertex 'newthere' with weight=1
Add comment
Bags should do not have -ve quantities, Jon Blunt, 2005/06/19
If you decrement a value below 0 there is no logical meaning and you can break the validity of len(myBag).
>>> x =bag.bag()
>>> x['a']
0
>>> x['a']-=1
>>> x['a']
-1
>>> len (x)
Traceback (most recent call last):
File "", line 1, in -toplevel-
len (x)
ValueError: __len__() should return >= 0
Add comment
Inequality operator, Remy Blank, 2004/11/29
Shouldn't the __ne__() operator be:
def __ne__(self, other):
if not isinstance(other, bag):
return True
return self._data != other._data
I.e. if the other object is not a bag, it is *not* equal to the bag?
Add comment
__iter__, Steven Bethard, 2004/11/29
Why does __iter__ iterate over (item, count) pairs as opposed to just items like dict? That is, why not try to parallel the current Python objects more?
Add comment
remove, Steven Bethard, 2004/11/29
I think maybe remove needs renamed. If I called the remove method on a bag, I would expect the count to be zeroed, not decremented. Maybe 'decrement' is a better name? I would also like a remove method that really does zero the count -- this is useful for pruning a bag by count... Of course, a 'prune' method that removes all items with less than or equal to a given count would be even better for my purposes...
Add comment
Thinking in abstract vs. implementation, Andreas Kloss, 2004/11/30
If you think of the bag as a key:count dict, then yes, "remove" should remove the whole key, and "decrement" could decrement until 0 (then remove).
On the other hand, if you think of the bag as an unordered container for multiple possibly-equal things, then it is sensible for "remove" to remove just one element instead of all elements being equal to the one you happened to select.
Case in point: Suppose you have a bag of many black and white marbles. You remove a white marble. Would you expect the bag to not contain any white marble? The method to remove all elements of an equality class should be aptly named "removeAll" or something to this effect.
-- Andre
Providing high quality math zealotry since 1977 ;)
Add comment
Steven Bethard, 2004/11/30
My typical use for a bag class would be to hold word counts, not marbles. ;) So my abstraction *is* keys with counts. Of course, it's a moot point since the remove method's been removed in favor of __getitem__ and __setitem__.
Add comment
Still need add and remove, Martin v. Löwis, 2005/03/27
Users coming from Smalltalk or Objective C, or users familiar with C++ multiset will expect that a collections.bag is a container that is capable of holding the same thing (say, a string) multiple times. Currently, there appears to be no .add method, which I think there should be. Also, .remove should remove a single item, not all copies of the same item. For convenience, .addAll might be useful, to implement a frequency count.
Add comment
sortedbycount, Steven Bethard, 2004/11/29
I don't think sortedbycount should be a method. As your implementation shows, this functionality is already available with sorted(). I'm not sure that sortedbycount actually gains us much here...
Add comment
mostcommon, Steven Bethard, 2004/11/30
Thanks, I like mostcommon a lot better!
Add comment
specialcase update with another bag, Jim Jewett, 2004/12/02
I would expect to able to write
b1 = Bag()
b2 = Bag()
...
b2.update(b1)
As written, this turns out to be a fairly inefficient way of doing it. If b1[k] == 20, it will take 20 separate calls to the iterator, the length adjuster, and the dictionary's __setitem__
Add comment
itercounts is confusing, Steven Bethard, 2005/02/16
I was looking at this again today because I need to use something like this right now, and it strikes me that itercounts is kind of confusing. For my current task, I want just the counts (i.e. the equivalent of _data.itervalues()). To me, itercounts sounds like it does this, not iterate through the (item, count) pairs.
So I guess I have two suggestions:
(1) Make itercounts iterate just through the counts, and
(2) Rename itercounts to something like iteritems or iteritemcounts.
Thanks again!
Add comment
Union, intersection, and difference of two bags?, Dan Stromberg, 2007/11/27
This class would be more general if it supported union, intersection and difference of two bags:
http://www.brpreiss.com/books/opus5/html/page398.html
...making it more truly like "sets with duplicates".
I'll probably need a method that does a modified union, where b.minunion(c) would give a bag with elements e s.t. their repetition counts are max(|e| in b, |e| in c). But I'm not sure I should foist this weird operation on the rest of the python world. :)
Add comment
|
|
|
|
|
 |
|