ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> python-list
python-list
Re: HTMLParser rejects real-life tagsoup
by Thanos Vassilakis other posts by this author
Feb 18 2003 12:38AM messages near this date
Re: PyCon DC 2003 Sprints | Re: Debugging an unhandled exception?
In the real world most html templates or pages created by designers are
non-standard and will break parsers. That is why we  NYSE used our own html
parsers for templating: The first based on scriptfoundry's  tagparser which
was very fast and easier to use than HTMLParser, and now we use pso.parser.
This is fast, elegant and robust.

http://sourceforge.net/projects/pso/
and see docs at:
http://sourceforge.net/docman/?group_id=49265

thanos



                                                                                            
                                   
                    Rene Pijlman                                                            
                                   
                    <reageer.in@[...].nie        To:     python-list@[...].org              
                                     
                    uwsgroep>                  cc:                                           
                                   
                    Sent by:                  Subject:     Re: HTMLParser rejects real-life 
tagsoup                            
                    python-list-admin@                                                      
                                   
                    python.org                                                              
                                   
                                                                                            
                                   
                                                                                            
                                   
                    02/12/2003 05:09                                                        
                                   
                    PM                                                                      
                                   
                                                                                            
                                   
                                                                                            
                                   




Gerhard Häring:
> Rene Pijlman wrote:
> > I've been using the HTMLParser module to process external web
> > pages that I don't control. HTMLParser seems to be rather strict
> > [...]
> > Any suggestions on how to handle this? [...]
> 
> I'd try tidying up the HTML first:
> http://www.lemburg.com/files/python/mxTidy.html

Great idea, it works fine now. Thanks!

--
René Pijlman

Wat wil jij leren?  http://www.leren.nl
--
http://mail.python.org/mailman/listinfo/python-list





-----------------------------------------
This message and its attachments may contain  privileged and confidential information.  If y
ou are not the intended recipient(s), you are prohibited from printing, forwarding, saving o
r copying this email.  If you have received this e-mail in error, please immediately notify 
the sender and delete this e-mail and its attachments from your computer.


-- 
http://mail.python.org/mailman/listinfo/python-list

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved