ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> ruby-talk
ruby-talk
Using Nokogiri
by Jzakiya other posts by this author
Nov 8 2009 10:35AM messages near this date
Re: wxRuby + rcairo ... are driving me mad. | Re: Using Nokogiri
I'm trying to scrape some data off websites using nokogiri

require 'rubygems'
require 'open-uri'
require 'nokogiri'   #using the latest 1.4.0


url = 'http://www.whateverwebsitenameis.org'

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF 'view
source' )

-------------------------------------------------------------------------
<h2> Association Detail</h2>

		<div class="sectionHeaderText" style="padding-bottom: 6pt;"> DETAIL
DIRECTORY RESULTS</div> 

1)		<b> Some Institute name</b><Br><br>
2)		some address<Br>  city, st zip<br>
3)
4)		United States <Br> 
5)
6)			Phone:
7)
8)				(123) 456-7890<Br> 
9)
10		<br> 
11)		Web address: <a href="Http://www.xyz.org"
target="_Blank"> www.xyz.org</a><Br>

		<br> <br>

		<A href="javascript:history.back();"> Back to Search Results</
a> <br><br>


		<A href="AssociationSearch.cfm"> Search Again</a>

</td> 
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url:  'http://www.xyz.org'

I can find the beginning of this section of code by doing this:

doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details.  I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there's something I'm just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.
Thread:
Jzakiya
7stud --
Mark Thomas
Jzakiya
7stud --
Mark Thomas
7stud --
Mark Thomas
Jzakiya
Jzakiya
Mark Thomas
7stud --
Mark Thomas
Jzakiya
Mark Thomas
7stud --
7stud --
7stud --

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved