Re: HTMLParser rejects real-life tagsoup
by Thanos Vassilakis other posts by this author
Feb 18 2003 12:38AM messages near this date
Re: PyCon DC 2003 Sprints
|
Re: Debugging an unhandled exception?
In the real world most html templates or pages created by designers are
non-standard and will break parsers. That is why we NYSE used our own html
parsers for templating: The first based on scriptfoundry's tagparser which
was very fast and easier to use than HTMLParser, and now we use pso.parser.
This is fast, elegant and robust.
http://sourceforge.net/projects/pso/
and see docs at:
http://sourceforge.net/docman/?group_id=49265
thanos
Rene Pijlman
<reageer.in@[...].nie To: python-list@[...].org
uwsgroep> cc:
Sent by: Subject: Re: HTMLParser rejects real-life
tagsoup
python-list-admin@
python.org
02/12/2003 05:09
PM
Gerhard Häring:
> Rene Pijlman wrote:
> > I've been using the HTMLParser module to process external web
> > pages that I don't control. HTMLParser seems to be rather strict
> > [...]
> > Any suggestions on how to handle this? [...]
>
> I'd try tidying up the HTML first:
> http://www.lemburg.com/files/python/mxTidy.html
Great idea, it works fine now. Thanks!
--
René Pijlman
Wat wil jij leren? http://www.leren.nl
--
http://mail.python.org/mailman/listinfo/python-list
-----------------------------------------
This message and its attachments may contain privileged and confidential information. If y
ou are not the intended recipient(s), you are prohibited from printing, forwarding, saving o
r copying this email. If you have received this e-mail in error, please immediately notify
the sender and delete this e-mail and its attachments from your computer.
--
http://mail.python.org/mailman/listinfo/python-list
|