ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Problem timing out XML::LibXML parse_html_string call
by Sam Tregar other posts by this author
Feb 3 2009 11:44AM messages near this date
view in the new Beta List Site
RE: XML::Pastor v0.52 is released - A REVOLUTIONARY way to deal withXML | Re: Problem timing out XML::LibXML parse_html_string call
& XSLT Hello all.  I'm using XML::LibXML to parse some HTML.  Mostly it's working
great - fast and very useful XPath support.  My problem is that it's choking
on some very bad HTML in a very bad way - it's sitting on the CPU until
killed manually.  I expected some HTML wouldn't parse, so this isn't such a
tragedy.  What is a big problem is that my attempt to work around this with
alarm() aren't working!

Here's my code:

use strict;
use warnings;
use XML::LibXML;

my $html = do { local $/; <>  };

my $libxml = XML::LibXML-> new();
#$libxml-> recover(2);

eval {
    local $SIG{ALRM} = sub { die "TIMEOUT\n" };
    alarm(10);
    $libxml-> parse_html_string($html);
    alarm(0);
};
if ($@ and $@ eq "TIMEOUT\n") {
    warn "Timed out ok.\n";
} elsif ($@) {
    die $@;
}

If I replace the parse call with sleep(20) then it works as expected - the
alarm triggers and the timeout is caught.  If I run it as-is with my sample
HTML then it never stops until killed.  If you want to play along at home
here's the test file:

http://sam.tregar.com/libxml-fail.html

BEWARE: that's some really bad HTML and it not only breaks XML::LibXML but
it also crashed Firefox on me.  You probably don't want to load it in your
browser.

I've never had alarm() fail like this.  Is there an alternative I can try?
Any other ideas about how to handle this?

Thanks!
-sam
Thread:
Sam Tregar
Aaron Crane
Sam Tregar
Bjoern Hoehrmann

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved