ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> numpy-discussion
numpy-discussion
Re: [Numpy-discussion] Histograms of extremely large data sets
by Eric Jones other posts by this author
Dec 13 2006 10:03PM messages near this date
Re: [Numpy-discussion] Histograms of extremely large data sets | Re: [Numpy-discussion] Histograms of extremely large data sets
Looks to me like Rick's version is simpler and faster.It looks like it 
offers a speed-up of about 1.6 on my machine over the weave version.  I 
believe this is because the sorting approach results in quite a few less 
compares than the algorithm I used. 

Very cool.  I vote that his version go into numpy.

eric



Rick White wrote:
>  On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:
> 
>    
> > I'm trying to generate histograms of extremely large datasets.  I've
> > tried a few methods, listed below, all with their own shortcomings.
> > Mailing-list archive and google searches have not revealed any
> > solutions.
> >     
> 
>  The numpy.histogram function can be modified to use memory much more  
>  efficiently when the input array is large, and the modification turns  
>  out to be faster even for smallish arrays (in my tests, anyway).   
>  Below is a modified version of the histogram function from  
>  function_base.py.  It is almost identical, but it avoids doing the  
>  sort of the entire input array simply by dividing it into blocks.   
>  (It would be even better to avoid the call to ravel too.)  The only  
>  other messy detail is that the builtin range function is shadowed by  
>  the 'range' parameter.
> 
>  In my timing tests this is about the same speed for arrays about the  
>  same size as the block size and is faster than the current version by  
>  30-40% for large arrays.  The speed difference increases as the array  
>  size increases.
> 
>  I haven't compared this to Eric's weave function, but this has the  
>  advantages of being pure Python and of being much simpler.  On my  
>  machine (MacBook Pro) it takes about 4 seconds for an array with 100  
>  million elements.  The time increases perfectly linearly with array  
>  size for arrays larger than a million elements.
>  					Rick
> 
>  from numpy import *
> 
>  lrange = range
>  def histogram(a, bins=10, range=None, normed=False):
>       a = asarray(a).ravel()
>       if not iterable(bins):
>           if range is None:
>               range = (a.min(), a.max())
>           mn, mx = [mi+0.0 for mi in range]
>           if mn == mx:
>               mn -= 0.5
>               mx += 0.5
>           bins = linspace(mn, mx, bins, endpoint=False)
> 
>       # best block size probably depends on processor cache size
>       block = 65536
>       n = sort(a[:block]).searchsorted(bins)
>       for i in lrange(block,len(a),block):
>           n += sort(a[i:i+block]).searchsorted(bins)
>       n = concatenate([n, [len(a)]])
>       n = n[1:]-n[:-1]
> 
>       if normed:
>           db = bins[1] - bins[0]
>           return 1.0/(a.size*db) * n, bins
>       else:
>           return n, bins
> 
>  _______________________________________________
>  Numpy-discussion mailing list
>  Numpy-discussion@[...].org
>  http://projects.scipy.org/mailman/listinfo/numpy-discussion
> 
>    

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@[...].org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Thread:
Cameron Walsh
Rick White
Brian Granger
Eric Jones
Cameron Walsh
Rick White
Eric Jones
Cameron Walsh
Cameron Walsh
Eric Jones
Giorgio Luciano
Sven Schreiber
Christopher Barker
Cameron Walsh
Eric Jones

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved