|
|
 |
perl-xml
Problem timing out XML::LibXML parse_html_string call
by Sam Tregar other posts by this author
Feb 3 2009 11:44AM messages near this date
view in the new Beta List Site
RE: XML::Pastor v0.52 is released - A REVOLUTIONARY way to deal withXML
|
Re: Problem timing out XML::LibXML parse_html_string call
& XSLT Hello all. I'm using XML::LibXML to parse some HTML. Mostly it's working
great - fast and very useful XPath support. My problem is that it's choking
on some very bad HTML in a very bad way - it's sitting on the CPU until
killed manually. I expected some HTML wouldn't parse, so this isn't such a
tragedy. What is a big problem is that my attempt to work around this with
alarm() aren't working!
Here's my code:
use strict;
use warnings;
use XML::LibXML;
my $html = do { local $/; <> };
my $libxml = XML::LibXML-> new();
#$libxml-> recover(2);
eval {
local $SIG{ALRM} = sub { die "TIMEOUT\n" };
alarm(10);
$libxml-> parse_html_string($html);
alarm(0);
};
if ($@ and $@ eq "TIMEOUT\n") {
warn "Timed out ok.\n";
} elsif ($@) {
die $@;
}
If I replace the parse call with sleep(20) then it works as expected - the
alarm triggers and the timeout is caught. If I run it as-is with my sample
HTML then it never stops until killed. If you want to play along at home
here's the test file:
http://sam.tregar.com/libxml-fail.html
BEWARE: that's some really bad HTML and it not only breaks XML::LibXML but
it also crashed Firefox on me. You probably don't want to load it in your
browser.
I've never had alarm() fail like this. Is there an alternative I can try?
Any other ideas about how to handle this?
Thanks!
-sam
Thread:
Sam Tregar
Aaron Crane
Sam Tregar
Bjoern Hoehrmann
|
|
|
 |
|