ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> xml-dev
xml-dev
Re: [xml-dev] HTML parser
by Niels Peter Strandberg other posts by this author
Mar 4 2002 9:09PM messages near this date
Re: [xml-dev] HTML parser | RE: [xml-dev] XML to Oracle
I forgot one(two) more html parser:

You have Anders Kristensens HEX - http://www-
uk.hpl.hp.com/people/sth/java/hex.html. It is quite old. It uses sax(1), 
to build a dom tree. I updated it to sax2 and xerces dom, but then 
started on my own project.

The way that hex handles the wellformnes, is when building the dom tree. 
I moved that into a XMLFilter that allows you to do wellformnes on the 
sax stream. Anders is not a HP anymore, he works for another company, so 
the mail adress wont work!.

IBM has a "system" ANDES, witch is(was - i don't know) used to parse 
html pages (a lot of other interesting things), from the papers I read 
it sounded just like the tool I wanted, but I could not find any 
information on IBM sites. Anyone has any info about ANDES?

Niels Peter


On Monday, March 4, 2002, at 08:12 PM, Niels Peter Strandberg wrote:

>  In Java you have JTidy - http://lempinen.net/sami/jtidy/ or 
>  http://sourceforge.net/projects/jtidy/
>  It build it own w3c DOM tree. But you can traverse the tree to generate 
>  SAX events, or build a new Xerces, JDOM tree from the sax events. But 
>  tidy doesn't handle doublet attributes + more.
> 
>  In C you have Tidy for all major platforms, and it is very fast. GUI's 
>  exists. I can be found here - http://tidy.sourceforge.net/
> 
>  In Java, Andy Clark, IBM  a Xerces programmer, has made a "preview" of 
>  a HTML parser using the new Xerces xni. He posted the source code to 
>  the xerces mailing list. Andy Clark is a parser profs. so he know what 
>  he is doing.
> 
>  Im also working on a HTML parser, but it to early to talk about. 
>  Parsing HTML documents is often for capturing information from a page, 
>  and I find myself using XSLT, XMLFilters etc. to extract data, and it 
>  is powerful but not very simple.
> 
>  A html parsing is not always about wellformnes, but about extracting 
>  information, using RE, simple text patterns. Then you have XPath and 
>  XSL, witch requires wellformed (x)html document to work, and that 
>  requires building of dom trees, witch is a memory and speed problem. 
>  Much more could be said on this.....
> 
>  Digital (now compaq) tried to make a "web language" that you can use to 
>  fetch pages from the web, and extract data. Take a look at it at -  
>  http://www.research.compaq.com/SRC/WebL/.  There is problems with java 
>  1.3, you need to make some small changes to the source code (Im running 
>  it on java 1.3 on Mac os X).
> 
>  Niels Peter
> 
> 
>  On Monday, March 4, 2002, at 06:24 PM, Alexey N. Shananin wrote:
> 
>  >Hi!
>  >I'm looking for a parser for HTML.
>  >I know that XML parsers can't correctly handle HTML tags because of 
>  theese
>  >tags might be unclosed( I mean <br> tag, but not <br/> or for 
>  example...).
>  ZI heared about XHTML standart. It's supported by XML parsers, as far 
>  as I
> 
> 
>  -----------------------------------------------------------------
>  The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>  initiative of OASIS <http://www.oasis-open.org>
> 
>  The list archives are at http://lists.xml.org/archives/xml-dev/
> 
>  To subscribe or unsubscribe from this list use the subscription
>  manager: <http://lists.xml.org/ob/adm.pl>
> 


-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org> , an
initiative of OASIS <http://www.oasis-open.org> 

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl> 
Thread:
Niels Peter Strandberg
Niels Peter Strandberg

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved