ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> python-list
python-list
getting a submatrix of all true
by John Hunter other posts by this author
Jul 2 2003 7:16PM messages near this date
InterScan NT Alert | Re: getting a submatrix of all true
I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing.  I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution.  I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values.  I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.


Example:
   
  0 0 0 
  0 0 0 candidate sub matrix has 12 elements
  0 0 0       
  0 0 0 

1 0 0 0 1
0 0 0 0 0    0 0 0 0 0    
0 0 0 0 0    0 0 0 0 0  candidate submatrix has 15 elements
0 0 0 0 0    0 0 0 0 0 
0 0 1 0 0   

      0 0
      0 0 candidate submatrix has 8 elements
      0 0 
      0 0 

I want to programatically extract the 15 element matrix

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

    from Numeric import nonzero, array, take, sum

    X = array([[1, 0, 0, 0, 1],
               [0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0],
               [0, 0, 1, 0, 0]])

    goodObsInd = nonzero(sum(X,1)<2)  # observations with < 2 missing variables
    X = take(X, goodObsInd)           # drop the bad

    goodVarInd = nonzero(sum(X)==0)   # variables with no missing data
    X = take(X, goodVarInd, 1 )       # drop the bad variables

    print X


John Hunter

-- 
http://mail.python.org/mailman/listinfo/python-list
Thread:
John Hunter
Roman Suzi

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved