ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-ai
perl-ai
Re: text categorization with SVM and NaiveBayes
by Tom Fawcett other posts by this author
Jan 8 2007 12:23PM messages near this date
Re: text categorization with SVM and NaiveBayes | Re: text categorization with SVM and NaiveBayes
On Jan 7, 2007, at 9:23 PM, Ken Williams wrote:
> > I would happily ignore all this and use NB, but it has one major  
> > flaw.
> > "The winner takes it all", the first result returned is way too far
> > (as in distance :)) from the others, which isn't exactly accurate if
> > one cares of a balanced results pool. I don't know whether this is an
> > implementation problem - I poked around the rescale() function in
> > Util.pm with no real success - or a general algorithm problem. My  
> > goal
> > is to have an implementation that can say: this text is 60% cat X,  
> > 20%
> > cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
> > approach would you recommend (which algorithm, which  
> > implementation or
> > what path for implementing it ) ?
> 
>  Unfortunately, neither NB nor SVMs can really tell you that.  SVMs  
>  are purely discriminative, so all they can tell you is "I think  
>  this new example is more like class A than class B in my training  
>  data".  There's no probability involved at all.  That said, I  
>  believe there has been some research into how to translate SVM  
>  output scores into probabilities or confidence scores, but I'm not  
>  really familiar with it.
> 
>  NB on the surface would seem to be a better option since it's  
>  directly based on probabilities, but again the algorithm was  
>  designed only to discriminate, so all those denominators that are  
>  thrown away (the "P(words)" terms in the A::NB documentation) mean  
>  that the notion of probabilities is lost.  The rescale() function  
>  is basically just a hack to return scores that are a little more  
>  convenient to work with than the raw output of the algorithm.  As  
>  you've seen, it tends to be a little arrogant, greatly exaggerating  
>  the score for the first category and giving tiny scores to the  
>  rest.  I'm sure there are better algorithms that could be used  
>  there, but in many cases either one doesn't really care about the  
>  actual scores, or one (*ahem*) does something ad hoc like taking  
>  the square root of all the scores, or the fifth root, or whatever,  
>  just to get some numbers that look better to end users.

Just to add a note here: Ken is correct -- both NB and SVMs are known  
to be rather poor at providing accurate probabilities.  Their scores  
tend to be too extreme.  Producing good probabilities from these  
scores is called calibrating the classifier, and it's more complex  
than just taking a root of the score.  There are several methods for  
calibrating scores.  The good news is that there's an effective one  
called isotonic regression (or Pool Adjacent Violators) which is  
pretty easy and fast.  The bad news is that there's no plug-in (ie,  
CPAN-ready) perl implementation of it (I've got a simple  
implementation which I should convert and contribute someday).

If you want to read about classifier calibration, google one of these  
titles:

"Transforming classifier scores into accurate multiclass probability  
estimates"
by Bianca Zadrozny and Charles Elkan

"Predicting Good Probabilities With Supervised Learning"
by A. Niculescu-Mizil and R. Caruana

Regards,
-Tom
Thread:
Zgrim
Ken Williams
Tom Fawcett
Ken Williams

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved