RE: spidering/crawling/scraping a site..
by Bruce other posts by this author
Oct 27 2005 2:11PM messages near this date
view in the new Beta List Site
RE: spidering/crawling/scraping a site..
|
How to sort these elements
hi...
decided to try to use the www::checksite::spider to try to create/write a
quick spider for the http://jobboardsoftware.biz/demo/admin/login.apsx
site...
i blew it!!!
the following code gives me some sort of hash, but i'm pretty sure i haven't
correctly filled in the login (user/passwd) form correctly...
any thoughts??
i'm not exactly sure what the BA_Mech is doing, or why it might be needed. i
tried to do the form submit directly from the spider, but the perl code
threw an error..
so, basically, i'm guessing!!!
package BA_Mech;
use base 'WWW::Mechanize';
$mech = WWW::Mechanize-> new();
my $start1 = "http://jobboardsoftware.biz/demo/admin/login.aspx";
$mech-> get($start1);
$mech-> submit_form(
form_name => 'Form1',
button => 'Button1',
fields => {
username => 'demo',
password => 'demo'
}
);
package Main;
use WWW::CheckSite::Spider;
my $start = "http://jobboardsoftware.biz/demo/admin/login.aspx";
my $sp = WWW::CheckSite::Spider-> new(
ua_class => 'BA_Mech',
uri => $start,
);
while (my $page = $sp-> get_page)
{
print $page;
print "\n";
}
die;
-bruce
-----Original Message-----
From: Thomas, Mark - BLS CTR [mailto:Thomas.Mark@[...].gov]
Sent: Thursday, October 27, 2005 10:34 AM
To: 'bedouglas@earthlink.net'
Subject: RE: spidering/crawling/scraping a site..
OK, there's a difference between using standard HTTP authentication (the
browser dialog box) and form-based authentication (which every site
implements differently). If it is the former, most mirroring tools can do it
already. But if authentication is done through the web app like your
example, you'll have to use Mech.
- Mark.
> -----Original Message-----
> From: bruce [mailto:bedouglas@[...].net]
> Sent: Thursday, October 27, 2005 1:22 PM
> To: Thomas, Mark - BLS CTR
> Subject: RE: spidering/crawling/scraping a site..
>
> thanks!!!
>
> i would have thought that there would have been a bunch of
> little/big apps
> that were used to parse user/passwd protected login fom sites...
>
> guess i was wrong..
>
>
> but this isn't for some evil/take over the world project.
> i've got a few
> sites that i'm looking at, that are passwd/login protected.
> rather than
> loign to the sites, i thought i'd scrape them, and then compare them
> locally, when i wanted. which is why i was saying i didn't
> think this was
> going to be more than a few minutes....
>
> -bruce
>
>
> -----Original Message-----
> From: Thomas, Mark - BLS CTR [mailto:Thomas.Mark@[...].gov]
> Sent: Thursday, October 27, 2005 9:51 AM
> To: 'bedouglas@earthlink.net'
> Subject: RE: spidering/crawling/scraping a site..
>
>
> Oh, you want a MIRRORING app that will mirror stuff that you
> have to log in
> to get! Correct terminology is everything. What nefarious
> purposes do you
> want that for?
>
> What you need is something that can use a Mech object to
> spider with. You
> get the Mech object logged in, then pass it to the spider,
> which does its
> dirty deed.
>
> You're probably the only person in the world that wants to do
> that. So you
> probably won't find anything that does it out-of-the-box, but
> I found a
> module that uses a Mech object to mirror a site:
> WWW::CheckSite::Spider. It
> only takes a start URI, but it's probably a small
> modification to make it
> accept a Mech object that is already logged in. Then you'd be
> able to scrape
> all the pages.
>
> Of course writing a Mech spider yourself is an option, and I
> say it would be
> simple, as there is already an extract_links() function. Just
> do that on
> every page you visit, push the links onto a @links_to_check
> array, and keep
> fetching until the array is empty! Probably a 5 minute task.
>
> - Mark.
>
> > -----Original Message-----
> > From: bruce [mailto:bedouglas@[...].net]
> > Sent: Thursday, October 27, 2005 12:27 PM
> > To: Thomas, Mark - BLS CTR
> > Subject: RE: spidering/crawling/scraping a site..
> >
> > mark,
> >
> > i already new the mech part.. and i know i can write/create a
> > crawler. but
> > that wouldn't take the 5 mins i thought this task would take!!!
> >
> > i was looking for a solution that may have already been
> > created, which was
> > the initial post.
> >
> > i had thought wget would have been suitable, but it has no
> > provision for the
> > user/passwd form. i also thought about using a perl script
> > with mech, and
> > then calling wget to allow the rest of the site to be
> > crawled.. didn't work
> > either... so i was looking for an actual crawling app.. if i
> > could find a
> > quick/easy one, i can modify it for my needs, as opposed to
> > writing one...
> >
> > this is what i was looking for, but i really do appreciate
> > your assistance!!
> >
> > -bruce
> >
> >
> > -----Original Message-----
> > From: Thomas, Mark - BLS CTR [mailto:Thomas.Mark@[...].gov]
> > Sent: Thursday, October 27, 2005 9:21 AM
> > To: 'bedouglas@earthlink.net'
> > Subject: RE: spidering/crawling/scraping a site..
> >
> >
> >
> > > but thanks for the laugh! i was referring to a generalizable
> > > app that i
> > > could modify to crawl through the site to get the underlying
> > > information, as opposed to the one page.
> >
> > Bruce, I'm telling you, Mechanize can get what you want to
> > get. Trust me.
> >
> > I've added ONE LINE and now it gets the statistics page. See
> > below. Parse
> > what you want out of it.
> >
> > You should try Mechanize! You'll like it! Seriously, it's
> > like a browser
> > with a remote control. You can do ANYTHING a browser can do.
> >
> > P.S.
> > I recommend XML::LibXML to parse the HTML, because it makes
> > extracting the
> > information from HTML very easy. For example, grabbing the
> > "Avg. Job Post
> > Duration" would be this line:
> >
> > print $page->findvalue('//span[@id=avg_duration]'); #prints
> > "57.5 days"
> >
> >
> > P.P.S.
> > It seems every few weeks you pop up on the list and ask a
> > question for which
> > the answer is WWW::Mechanize. And that's what I tell you. In
> > the future,
> > unless you post Mechanize code you need help with, I'M NOT
> > GOING TO HELP.
> > This is getting tedious.
> >
> > P.P.P.S.
> >
> > Here's the code with the added line you need to get to the
> > statistics page.
> > You can get to other pages too, so don't even think about
> > asking *that*
> > question! X-/
> >
> > #!/usr/bin/perl -w
> >
> > use WWW::Mechanize;
> > my $start_url = 'http://jobboardsoftware.biz/demo/admin/login.aspx';
> >
> > my $mech = WWW::Mechanize->new();
> > $mech->get($start_url);
> > $mech->submit_form(
> > form_name => 'Form1',
> > button => 'Button1',
> > fields => {
> > username=>'demo',
> > password=>'demo',
> > },
> > );
> > $mech->follow_link( url_regex => qr/statistics/ );
> > print $mech->content;
> >
> >
>
>
_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Bruce
Mark - BLS CTR Thomas
Mark - BLS CTR Thomas
Mark - BLS CTR Thomas
Bruce
Bruce
|