ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-ai
perl-ai
Re: text categorization with SVM and NaiveBayes
by Ken Williams other posts by this author
Jan 8 2007 4:20AM messages near this date
text categorization with SVM and NaiveBayes | Re: text categorization with SVM and NaiveBayes
On Jan 5, 2007, at 7:10 AM, zgrim wrote:

>  So, back to my dilemmas. :) The results are puzzling, as many of the
>  research papers on the subject I've consulted say that SVM is
>  supposedly the best algorithm for this task. The radial kernel should
>  give the best results, for empirical-found values of gamma and C.

This may be an issue with your corpus - I quite often find that when  
I don't have enough training data for the SVM to pick up on the  
"truth" patterns, or (somewhat equivalently) when there's a lot of  
noise in the data, a linear kernel will outperform a radial (RBF).  I  
tend to think that's because the RBF is more expressive, and it's  
overfitting the noise in the training set.


>  Ignoring the fact that SVM is much, much slower to train than NB, it
>  still has worse accuracy. What am I doing wrong ?

That may be an accident of your corpus too.  Are you using cross- 
validation for these experiments?  If so, you should be able to get  
some error bars to tell whether the difference is statistically  
significant or not.  I'm guessing a 2% advantage may not be, in this  
case.

>  I would happily ignore all this and use NB, but it has one major flaw.
>  "The winner takes it all", the first result returned is way too far
>  (as in distance :)) from the others, which isn't exactly accurate if
>  one cares of a balanced results pool. I don't know whether this is an
>  implementation problem - I poked around the rescale() function in
>  Util.pm with no real success - or a general algorithm problem. My goal
>  is to have an implementation that can say: this text is 60% cat X, 20%
>  cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
>  approach would you recommend (which algorithm, which implementation or
>  what path for implementing it ) ?

Unfortunately, neither NB nor SVMs can really tell you that.  SVMs  
are purely discriminative, so all they can tell you is "I think this  
new example is more like class A than class B in my training data".   
There's no probability involved at all.  That said, I believe there  
has been some research into how to translate SVM output scores into  
probabilities or confidence scores, but I'm not really familiar with it.

NB on the surface would seem to be a better option since it's  
directly based on probabilities, but again the algorithm was designed  
only to discriminate, so all those denominators that are thrown away  
(the "P(words)" terms in the A::NB documentation) mean that the  
notion of probabilities is lost.  The rescale() function is basically  
just a hack to return scores that are a little more convenient to  
work with than the raw output of the algorithm.  As you've seen, it  
tends to be a little arrogant, greatly exaggerating the score for the  
first category and giving tiny scores to the rest.  I'm sure there  
are better algorithms that could be used there, but in many cases  
either one doesn't really care about the actual scores, or one  
(*ahem*) does something ad hoc like taking the square root of all the  
scores, or the fifth root, or whatever, just to get some numbers that  
look better to end users.

As for a better alternative, I'm not familiar with any that will be  
as accessible from a perl world, but you might want to look at some  
language modeling papers - I really like the LDA papers from Michael  
Jordan (no, not that Michael Jordan, this one: http:// 
citeseer.ist.psu.edu/541352.html), which are by no means  
straightforward, but they will indeed let you describe each document  
as generated by a mixture of categories.

  -Ken
Thread:
Zgrim
Ken Williams
Tom Fawcett
Ken Williams

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved