ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: sample with replacement
Submitter: Sean Ross (other recipes)
Last Updated: 2004/03/08
Version no: 1.2
Category: Algorithms

 

3 stars 1 vote(s)


Description:

For taking k random samples (with replacement) from a population, where k may be greater than len(population).

Source: Text Source

# credit author(s) of random.py
import random

def sample_wr(population, k):
    "Chooses k random elements (with replacement) from a population"
    n = len(population)
    _random, _int = random.random, int  # speed hack 
    result = [None] * k
    for i in xrange(k):
        j = _int(_random() * n)
        result[i] = population[j]
    return result

Discussion:

random.sample() lets you do random sampling without replacement. sample_wr() lets you sample with replacement.

Some simple examples:

tosses = sample_wr(('H', 'T'), 100) # simulate 100 coin tosses
rolls = sample_wr(range(1,7), 100) # simulate 100 dice rolls

# make a random string from 200 characters over a given alphabet
from string import letters as alphabet
rstr = ''.join(sample_wr(alphabet, 200))


NOTE:

You could use

from random import choice
sample = [choice(population) for i in xrange(k)]

but that is 2-4 times slower than sample_wr(population, k) for 10E3 <= k <= 10E6

from random import random
n = len(population)
sample = [population[int(random()*n)] for i in xrange(k)]

is better but still slower.



Add comment

Number of comments: 5

math.floor() beats int(), Raymond Hettinger, 2004/03/11
It is faster still to use _int=math.floor for this situation (rounding positive numbers downward).
Add comment

Nix the previous comment, Raymond Hettinger, 2004/03/11
The float result still needs to be converted back to an integer at some point.
Add comment

In Py2.4, pre-allocation no longer rules, Raymond Hettinger, 2004/03/14
In the next version of Python, list comprehensions have been super-optimized and cannot be beat by pre-allocating and using indices.

def sample_wr(population, k):
    "Chooses k random elements (with replacement) from a population"
    n = len(population)
    _random, _int = random.random, int  # speed hack 
    return [_int(_random() * n) for i in xrange(k)]

And you can go bit faster using itertools:
def sample_wr(population, k):
    "Chooses k random elements (with replacement) from a population"
    n = len(population)
    _random, _int = random.random, int  # speed hack 
    return [_int(_random() * n) for i in itertools.repeat(None, k)]


Add comment

correction, accolades, and propositions, Sean Ross, 2004/03/14
return [population[_int(_random() * n)] for i in ... ]

I've been following python-dev, so I'm aware of the optimizations you've been making. Congratulations on your results to date, and thank you for your time and efforts.

I wonder, do you suppose the developers would accept changing random.sample to allow for sampling with replacement?

# replacement=False by default (backwards compatible)
random.sample(population, k, replacement=True)
Add comment

Adding a replace=False option to random.sample, Raymond Hettinger, 2004/03/16
For several reasons, probably not.

The straight-forward list comp does the trick pretty well.  Anything that someone can bang out without much thought is rarely a good candidate for building into the library unless the pattern is very general and the use cases very common.  The reasoning is that it is typically easier to bang out a couple of lines than to learn and remember dozens of method variations.  Part of the justification for the inclusion of sampling without replacement is that it took a great deal of skill and time to implement correctly.
The other issue is that there are plenty of use cases that just do not need the whole sample all a once (those are best served by a simple for-loop).  In contrast, sampling without replacement requires some state memory between calls.

Also, adding a new method is typically preferred to adding a keyword switch (for example, see itertools ifilter() and ifilterfalse()).  However, for the reasons listed above, the inclusion of this as a separate method is unlikely.

I put all of this here because it is useful to a wider audience.  Feel free to email me for any further discussion.
Add comment



Highest rated recipes:

1. A simple XML-RPC server

2. Web service accessible ...

3. IPy Notify

4. Changing return value ...

5. Quantum Superposition

6. Pickle objects under ...

7. Generalized delegates ...

8. Reorder a sequence (uses ...

9. Setting Win32 System ...

10. ObjectMerger




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.