ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> numpy-discussion
numpy-discussion
Re: [Numpy-discussion] Histograms of extremely large data sets
by Rick White other posts by this author
Dec 13 2006 8:40PM messages near this date
[Numpy-discussion] Histograms of extremely large data sets | Re: [Numpy-discussion] Histograms of extremely large data sets
On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:

>  I'm trying to generate histograms of extremely large datasets.  I've
>  tried a few methods, listed below, all with their own shortcomings.
>  Mailing-list archive and google searches have not revealed any
>  solutions.

The numpy.histogram function can be modified to use memory much more  
efficiently when the input array is large, and the modification turns  
out to be faster even for smallish arrays (in my tests, anyway).   
Below is a modified version of the histogram function from  
function_base.py.  It is almost identical, but it avoids doing the  
sort of the entire input array simply by dividing it into blocks.   
(It would be even better to avoid the call to ravel too.)  The only  
other messy detail is that the builtin range function is shadowed by  
the 'range' parameter.

In my timing tests this is about the same speed for arrays about the  
same size as the block size and is faster than the current version by  
30-40% for large arrays.  The speed difference increases as the array  
size increases.

I haven't compared this to Eric's weave function, but this has the  
advantages of being pure Python and of being much simpler.  On my  
machine (MacBook Pro) it takes about 4 seconds for an array with 100  
million elements.  The time increases perfectly linearly with array  
size for arrays larger than a million elements.
					Rick

from numpy import *

lrange = range
def histogram(a, bins=10, range=None, normed=False):
     a = asarray(a).ravel()
     if not iterable(bins):
         if range is None:
             range = (a.min(), a.max())
         mn, mx = [mi+0.0 for mi in range]
         if mn == mx:
             mn -= 0.5
             mx += 0.5
         bins = linspace(mn, mx, bins, endpoint=False)

     # best block size probably depends on processor cache size
     block = 65536
     n = sort(a[:block]).searchsorted(bins)
     for i in lrange(block,len(a),block):
         n += sort(a[i:i+block]).searchsorted(bins)
     n = concatenate([n, [len(a)]])
     n = n[1:]-n[:-1]

     if normed:
         db = bins[1] - bins[0]
         return 1.0/(a.size*db) * n, bins
     else:
         return n, bins

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@[...].org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Thread:
Cameron Walsh
Rick White
Brian Granger
Eric Jones
Cameron Walsh
Rick White
Eric Jones
Cameron Walsh
Cameron Walsh
Eric Jones
Giorgio Luciano
Sven Schreiber
Christopher Barker
Cameron Walsh
Eric Jones

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved