ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> python-tutor
python-tutor
Re: [Tutor] filtering a webpage for plucking to a Palm
by Kent Johnson other posts by this author
Jun 26 2005 6:32AM messages near this date
[Tutor] filtering a webpage for plucking to a Palm | Re: [Tutor] MineSweeper
Brian van den Broek wrote:
>  Hi all,
>  
>  I have a Palm handheld, and use the excellent (and written in Python) 
>  Plucker <http://www.plkr.org/> to spider webpages and format the 
>  results for viewing on the Palm.
>  
>  One site I 'pluck' is the Daily Python URL 
>  <http://www.pythonware.com/daily/>. From the point of view of a daily 
>  custom 'newspaper' everything but the last day or two of URLs is so 
>  much cruft. (The cruft would be the total history of the last 
>  seven'ish days, the navigation links for www.pythonware.com, etc.)
>  
>  Today, I wrote a script to parse the Daily URL, and create a minimal 
>  local html page including nothing but the last n items, n links, or 
>  last n days worth of links. (Which is employed is a user option.) 
>  Then, I pluck that, rather than the actual Daily URL site. Works 
>  great. :-)  (If anyone on the list is a fellow plucker'er and would be 
>  interested in my script, I'm happy to share.)
>  
>  In anticipation of wanting to do the same thing to other sites, I've 
>  spent a bit of time abstracting it. I've made some real progress. But, 
>  before I finish up, I've a voice in the back of my head asking if 
>  maybe I'm re-inventing the wheel.
>  
>  To my shame, I've not spent very much time at all exploring available 
>  frameworks and modules for any domain, and almost none for web-related 
>  tasks. So, does anyone know of any modules or frameworks which would 
>  make the sort of task I am describing easier?
>  
>  The difficulty in making my routine general is that pretty much each 
>  site will need its own code for identifying what counts as a distinct 
>  item (such as a URL and its description in the Daily URL) and what 
>  counts as a distinct block of items (such as a days worth of Daily URL 
>  items). I can't imagine there's a way around that, but if someone else 
>  has done much of the work in setting up the general structure to be 
>  tweaked for each site, that'd be good to know. (Doesn't feel like one 
>  that would be googleable.)

Beautiful Soup can help with parsing and accessing the web page. You could certainly write y
our plucker on top of it.
http://www.crummy.com/software/BeautifulSoup/

Alternately ElementTidy might help. It can parse web pages and it has limited XPath support.
 XPath might be a good language for expressing your plucking rules.
http://effbot.org/zone/element-tidylib.htm

An ideal package would be one that parses real-world HTML and has full XPath support, but I 
don't know of such a thing...maybe amara or lxml?

Kent

_______________________________________________
Tutor maillist  -  Tutor@[...].org
http://mail.python.org/mailman/listinfo/tutor
Thread:
Brian van den Broek
Kent Johnson

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved