Possible Bug in XML::SAX::PurePerl Reference implementation - and proposed fix
by John Wade other posts by this author
Mar 9 2006 2:05PM messages near this date
view in the new Beta List Site
Perl XML workshop at XML Prague 2006
|
[Announce] XML::XPathScript 1.01
& XSLT Hi All,
I think I have found a bug in the way that the XML::SAX::PurePerl parser
handles references. I am using PurePerl.pm,v 1.19 2005/10/24 19:22:12
with Perl 5.8.0 on RedHat Linux AS3 The bug manifests itself with the
module throwing a parser error "Invalid name in entity" while trying to
parse a reference. (in my case an apostrophe: ' ) I identified
the problem when parsing a very large XML file (over 5 million lines)
The problem occurred only on some references and was reproducible every
time. When I extracted out the problematic section of the xml into
another file and parsed it, I had no errors.
I added a debug print statement to the PurePerl module immediately
before line 743:
print "data = \"$data\"\n";
# EntityRef
my $name = $self-> Name($reader)
|| $self-> parser_error("Invalid name in entity", $reader);
The output of this debug print showed that the parser was not grabbing
the entire reference, only the first character or characters ("#") and
thus the pattern matches on lines 378 and 387 were not going to work.
I assume that the problem is that line 376 does not require that we grab
more than one character.
my $data = $reader-> data;
If the reader is near the end of the buffer, it is very possible that
the entire reference will not be in the buffer and thus unless we ask it
to retrieve more characters, we may get only a partial reference.
Obviously the odds of this occurring depends on the size of the buffer
and the position of the references in the xml. In the 5 million line
file I was parsing, there were 1407 references, but only three triggered
the problem.
Proposed fix:
I modified line 743 of PurePerl.pm to read:
my $data = $reader-> data(5);
This forces the reader to self-> read_more and fetch a new bufferful if
there are fewer than 5 characters in the buffer. After making this
change, I was able to parse the file successfully without any errors or
obvious problems Since my understanding of this module is very
limited, I will not presume that this is an appropriate thing to do for
all cases, but it seems to work for me, and I wanted to pass it along to
community as whole in case this had more general relevance.
p.s. I know I should be using a C based parser.
Thanks,
John Wade
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
|