[Tutor] Parsing HTML file
by Chris Heisel other posts by this author
Dec 11 2003 9:12PM messages near this date
RE: [Tutor] Detecting different list elements and counting them
|
Re: [Tutor] Parsing HTML file
Hi,
I'm working on a Python script that will go through a series of
directories and parse some HTML files.
I'd like to be able to read the HTML and extract certain components and
put them into a MySQL database.
For instance, in these files there will be a document title like this:
<h2 class="header"> This is the documents header</h2>
There would be content marked like this:
<!--START CONTENT-->
<p> Some content</p>
<p> Some more content</p>
<h4> A sub head</h4>
<p> Again</p>
<!--END CONTENT-->
I'm wondering what the best way to approach this problem is?
I was reading up on htmllib and HTMLParser. Should I use them or do some
regexp searches of the files for "<h2 class="header"> *</h2>"?
If I should use htmllib and HTMLParser any suggestions on their use?
I gather than I can set event handlers for say, an <h2> , tag, but can I
set event handlers for classes, like <h2 class="header"> , or for blocks
of commments like <!--START CONTENT--> and <!--END CONTENT-->
In a perferct world I would have gotten all this data in an XML format,
that would make my life easier, but the files are already there in HTML
and I've got to figure out how to extract some of the semantic content
and stuff it into a MySQL DB...
Many, many thanks in advance for your help,
Chris
_______________________________________________
Tutor maillist - Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Chris Heisel
Daniel Ehrenberg
Danny Yoo
Chris Heisel
Danny Yoo
|