RE: :LibXML error handling for non-UTF-8 data
by Andrew Strader other posts by this author
Jun 6 2006 6:54AM messages near this date
view in the new Beta List Site
Re: xml::grove fails during make install
|
Re: :LibXML error handling for non-UTF-8 data
& XSLT If their application outputs non-UTF-8 characters, then its output
encoding can't be UTF-8. Since your application expects UTF-8 in the
input, it's not compatible with theirs. The true solution is to make
sure that the same character encoding is being used on both ends.
Agreeing on the character encoding is every bit as important as agreeing
on the XML schema, but it's something that is too often overlooked.
Sorry for ranting, but in this line of work I see or hear about so many
band-aids for character encoding issues. The coup de gras was a database
table mapping specific hex values to be substituted in text strings,
which was essentially a poorly re-invented wheel of UTF-to-ISO
conversion. The developer had just been adding rows to the table every
time he discovered another non-ISO character in an input string. He
apparently never realized that the input and output must be in the same
character encoding.
Andrew Strader
-----Original Message-----
From: perl-xml-bounces@[...].com
[mailto:perl-xml-bounces@[...].com] On Behalf Of Ibrahim
Dawud
Sent: Tuesday, June 06, 2006 6:23 AM
To: perl-xml@[...].com
Subject: XML::LibXML error handling for non-UTF-8 data
Dear Colleagues,
We communicate with our suppliers via XML messages over the web.
We currently use XML::LibXML (version 1.58) to parse our incoming
messages:
Example XML:
<body>
<product>
<productID> 3661</productID>
<price> 100</price>
<name> Name</name>
<categoryName> Name</categoryName>
<categoryID> 28</categoryID>
.....
</product>
.....
</body>
Occasionally, a supplier will send an XML message that contains
non-UTF-8 characters in a product detail. The wrong encoding of that
particular data element causes the XML itself to be not well formed.
This results in a parser error as follows:
":1: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x91 0x65 0x61 0x73 "
Then it breaks.
We tried to use ( $parser-> recover(1); ) so that the parser can skip
over the error.
Unfortunately, this is not enough since we have no way of knowing that
an error occurred and that the parser returned bad data for that
product. We cannot validate the products for wrong data (no specific
data format expected).
What we need is some sort of error handling within the parsing method
that will detect non-UTF-8 data, raise an error, and then SKIP over
the whole product in the XML block that contains that error, and
continue to parse the rest of the XML document normally.
So we need your advice to solve this problem or a work around.
Example of the perl code used to parse the message:
###################################################################
my $parser=XML::LibXML-> new(); # create new object of
LibXML
# $parser-> recover(1);
my $tree=$parser-> parse_string($xml_msg); # start to parse xml
file
my $root=$tree-> getDocumentElement; # get the root element
<body>
my $count = 0;
my @ResultSet = ();
foreach my $product ($root-> findnodes('product')){
$ResultSet[$count][0] = $product-> findvalue('productID');
$ResultSet[$count][1] = $product-> findvalue('name');
$ResultSet[$count][3] = $product-> findvalue('categoryName');
$ResultSet[$count][4] = $product-> findvalue('categoryID');
$ResultSet[$count][5] = $product-> findvalue('price');
$count++;
}
###################################################################
Thank you and best regards.
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Andrew Strader
johns
Vaclav Barta
A. Pagaltzis
Ciaran Hamilton
|