ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> python-Tutor
python-Tutor
[Tutor] Parsing HTML file
by Chris Heisel other posts by this author
Dec 11 2003 9:12PM messages near this date
RE: [Tutor] Detecting different list elements and counting them | Re: [Tutor] Parsing HTML file
Hi,

I'm working on a Python script that will go through a series of 
directories and parse some HTML files.

I'd like to be able to read the HTML and extract certain components and 
put them into a MySQL database.

For instance, in these files there will be a document title like this:
<h2 class="header"> This is the documents header</h2>

There would be content marked like this:
<!--START CONTENT--> 
<p> Some content</p>
<p> Some more content</p>
<h4> A sub head</h4>
<p> Again</p>
<!--END CONTENT--> 

I'm wondering what the best way to approach this problem is?

I was reading up on htmllib and HTMLParser. Should I use them or do some 
regexp searches of the files for "<h2 class="header"> *</h2>"?

If I should use htmllib and HTMLParser any suggestions on their use?

I gather than I can set event handlers for say, an <h2> , tag, but can I 
set event handlers for classes, like <h2 class="header"> , or for blocks 
of commments like <!--START CONTENT-->  and <!--END CONTENT-->

In a perferct world I would have gotten all this data in an XML format, 
that would make my life easier, but the files are already there in HTML 
and I've got to figure out how to extract some of the semantic content 
and stuff it into a MySQL DB...

Many, many thanks in advance for your help,

Chris



_______________________________________________
Tutor maillist  -  Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Chris Heisel
Daniel Ehrenberg
Danny Yoo
Chris Heisel
Danny Yoo

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved