ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> numpy-discussion
numpy-discussion
Re: [Numpy-discussion] Histograms of extremely large data sets
by Cameron Walsh other posts by this author
Dec 13 2006 11:56PM messages near this date
Re: [Numpy-discussion] Histograms of extremely large data sets | Re: [Numpy-discussion] Histograms of extremely large data sets
Hi all,

Absolutely gorgeous, I confirm the 1.6x speed-up over the weave
version, i.e. a 25x speed-up over the existing version.

It would be good if the redefinition of the range function could be
changed in the numpy modules,  before it goes into subversion, to
avoid the need for Rick's line
lrange=range
before the new histogram function.  At some point I might try and test
different cache sizes for different data-set sizes and see what the
effect is.  For now, 65536 seems a good number and I would be happy to
see this replace the current numpy.histogram.

Thanks very much Eric and Rick, you've both taught me a lot, as well
as solving the original problem.  I'm sure this will be of use to
others in the future, so if there's anything I can do to assist in
getting this into the next numpy release, please let me know.

Best regards,

Cameron.


On 14/12/06, eric jones <eric@[...].com>  wrote:
>  Looks to me like Rick's version is simpler and faster.It looks like it
>  offers a speed-up of about 1.6 on my machine over the weave version.  I
>  believe this is because the sorting approach results in quite a few less
>  compares than the algorithm I used.
> 
>  Very cool.  I vote that his version go into numpy.
> 
>  eric
> 
> 
> 
>  Rick White wrote:
>  > On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:
>  >
>  >
>  >> I'm trying to generate histograms of extremely large datasets.  I've
>  >> tried a few methods, listed below, all with their own shortcomings.
>  >> Mailing-list archive and google searches have not revealed any
>  >> solutions.
>  >>
>  >
>  > The numpy.histogram function can be modified to use memory much more
>  > efficiently when the input array is large, and the modification turns
>  > out to be faster even for smallish arrays (in my tests, anyway).
>  > Below is a modified version of the histogram function from
>  > function_base.py.  It is almost identical, but it avoids doing the
>  > sort of the entire input array simply by dividing it into blocks.
>  > (It would be even better to avoid the call to ravel too.)  The only
>  > other messy detail is that the builtin range function is shadowed by
>  > the 'range' parameter.
>  >
>  > In my timing tests this is about the same speed for arrays about the
>  > same size as the block size and is faster than the current version by
>  > 30-40% for large arrays.  The speed difference increases as the array
>  > size increases.
>  >
>  > I haven't compared this to Eric's weave function, but this has the
>  > advantages of being pure Python and of being much simpler.  On my
>  > machine (MacBook Pro) it takes about 4 seconds for an array with 100
>  > million elements.  The time increases perfectly linearly with array
>  > size for arrays larger than a million elements.
>  >                                       Rick
>  >
>  > from numpy import *
>  >
>  > lrange = range
>  > def histogram(a, bins=10, range=None, normed=False):
>  >      a = asarray(a).ravel()
>  >      if not iterable(bins):
>  >          if range is None:
>  >              range = (a.min(), a.max())
>  >          mn, mx = [mi+0.0 for mi in range]
>  >          if mn == mx:
>  >              mn -= 0.5
>  >              mx += 0.5
>  >          bins = linspace(mn, mx, bins, endpoint=False)
>  >
>  >      # best block size probably depends on processor cache size
>  >      block = 65536
>  >      n = sort(a[:block]).searchsorted(bins)
>  >      for i in lrange(block,len(a),block):
>  >          n += sort(a[i:i+block]).searchsorted(bins)
>  >      n = concatenate([n, [len(a)]])
>  >      n = n[1:]-n[:-1]
>  >
>  >      if normed:
>  >          db = bins[1] - bins[0]
>  >          return 1.0/(a.size*db) * n, bins
>  >      else:
>  >          return n, bins
>  >
>  > _______________________________________________
>  > Numpy-discussion mailing list
>  > Numpy-discussion@[...].org
>  > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  >
>  >
> 
>  _______________________________________________
>  Numpy-discussion mailing list
>  Numpy-discussion@[...].org
>  http://projects.scipy.org/mailman/listinfo/numpy-discussion
> 
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@[...].org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Thread:
Cameron Walsh
Rick White
Brian Granger
Eric Jones
Cameron Walsh
Rick White
Eric Jones
Cameron Walsh
Cameron Walsh
Eric Jones
Giorgio Luciano
Sven Schreiber
Christopher Barker
Cameron Walsh
Eric Jones

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved