ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
Possible Bug in XML::SAX::PurePerl Reference implementation - and proposed fix
by John Wade other posts by this author
Mar 9 2006 2:05PM messages near this date
view in the new Beta List Site
Perl XML workshop at XML Prague 2006 | [Announce] XML::XPathScript 1.01
& XSLT Hi All,

I think I have found a bug in the way that the XML::SAX::PurePerl parser
handles references.  I am using PurePerl.pm,v 1.19 2005/10/24 19:22:12
with Perl 5.8.0 on RedHat Linux AS3   The bug manifests itself with the
module  throwing a parser error "Invalid name in entity" while trying to
parse a reference.  (in my case an apostrophe: ' )    I identified
the problem when parsing a very large XML file (over 5 million lines)
The problem occurred only on some references and was reproducible every
time.  When I extracted out the problematic section of the xml into
another file and parsed it, I had no errors.

I added a debug print statement to the PurePerl module immediately
before line 743:

print "data = \"$data\"\n";
# EntityRef
my $name = $self-> Name($reader)
    || $self-> parser_error("Invalid name in entity", $reader);

The output of this debug print showed that the parser was not grabbing
the entire reference, only the first character or characters ("#") and
thus the pattern matches on lines 378 and 387 were not going to work.

I assume that the problem is that line 376 does not require that we grab
more than one character.

my $data = $reader-> data;

If the reader is near the end of the buffer, it is very possible that
the entire reference will not be in the buffer and thus unless we ask it
to retrieve more characters, we may get only a partial reference.
Obviously the odds of this occurring depends on the size of the buffer
and the position of the references in the xml.   In the 5 million line
file I was parsing, there were 1407 references, but only three triggered
the problem.

Proposed fix:

I modified line 743 of PurePerl.pm to read:

my $data = $reader-> data(5);

This forces the reader to self-> read_more and fetch a new bufferful if
there are fewer than 5 characters in the buffer.  After making this
change, I was able to parse the file successfully without any errors or
obvious problems    Since my understanding of this module is very
limited, I will not presume that this is an appropriate thing to do for
all cases, but it seems to work for me, and I wanted to pass it along to
community as whole in case this had more general relevance.

p.s. I know I should be using a C based parser.

Thanks,
John Wade




_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved