ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> numpy-discussion
numpy-discussion
[Numpy-discussion] Histograms of extremely large data sets
by Cameron Walsh other posts by this author
Dec 12 2006 7:27PM messages near this date
[Numpy-discussion] .byteswap() and copy/view dilemma | Re: [Numpy-discussion] Histograms of extremely large data sets
Hi all,

I'm trying to generate histograms of extremely large datasets.  I've
tried a few methods, listed below, all with their own shortcomings.
Mailing-list archive and google searches have not revealed any
solutions.

Method 1:

import numpy
import matplotlib

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

histogram = pylab.hist(data, bins=range(0,256))
pylab.xlim(0,256)
pylab.show()

The problem with this method is it appears to never finish.  It is
however, extremely fast for smaller data sets, like 5x1000x1000 (1-2
seconds) instead of 500x1000x1000.


Method 2:

import numpy
import matplotlib

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

bins=numpy.zeros((256),dtype="uint32")
   for val in data.flat:
       bins[val]+=1
barchart = pylab.bar(xrange(256),bins,align="center")
pylab.xlim(0,256)
pylab.show()

The problem with this method is it is incredibly slow, taking up to 30
seconds for a 1x1000x1000 sample, I have neither the patience nor the
inclination to time a 500x1000x1000 sample.


Method 3:

import numpy

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

a=numpy.histogram(data,256)


The problem with this one is:

Traceback (most recent call last):
 File "<stdin> ", line 1, in <module>
 File "/usr/local/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 96, in histogram
   n = sort(a).searchsorted(bins)
ValueError: dimensions too large.


It seems that iterating over the entire array and doing it manually is
the slowest possible method, but that the rest are not much better.
Is there a faster method available, or do I have to implement method 2
in C and submit the change as a patch?

Thanks and best regards,

Cameron.
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@[...].org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
Thread:
Cameron Walsh
Rick White
Brian Granger
Eric Jones
Cameron Walsh
Rick White
Eric Jones
Cameron Walsh
Cameron Walsh
Eric Jones
Giorgio Luciano
Sven Schreiber
Christopher Barker
Cameron Walsh
Eric Jones

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved