ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
RE: :LibXML error handling for non-UTF-8 data
by Andrew Strader other posts by this author
Jun 6 2006 6:54AM messages near this date
view in the new Beta List Site
Re: xml::grove fails during make install | Re: :LibXML error handling for non-UTF-8 data
& XSLT If their application outputs non-UTF-8 characters, then its output
encoding can't be UTF-8. Since your application expects UTF-8 in the
input, it's not compatible with theirs. The true solution is to make
sure that the same character encoding is being used on both ends.
Agreeing on the character encoding is every bit as important as agreeing
on the XML schema, but it's something that is too often overlooked.

Sorry for ranting, but in this line of work I see or hear about so many
band-aids for character encoding issues. The coup de gras was a database
table mapping specific hex values to be substituted in text strings,
which was essentially a poorly re-invented wheel of UTF-to-ISO
conversion. The developer had just been adding rows to the table every
time he discovered another non-ISO character in an input string. He
apparently never realized that the input and output must be in the same
character encoding.

Andrew Strader
 
-----Original Message-----
From: perl-xml-bounces@[...].com
[mailto:perl-xml-bounces@[...].com] On Behalf Of Ibrahim
Dawud
Sent: Tuesday, June 06, 2006 6:23 AM
To: perl-xml@[...].com
Subject: XML::LibXML error handling for non-UTF-8 data

Dear Colleagues,

We communicate with our suppliers via XML messages over the web.
We currently use XML::LibXML (version 1.58) to parse our incoming
messages:

Example XML:
<body> 
<product> 
<productID> 3661</productID>
<price> 100</price>
<name> Name</name>
<categoryName> Name</categoryName>
<categoryID> 28</categoryID>
.....
</product> 
.....
</body> 

Occasionally, a supplier will send an XML message that contains
non-UTF-8 characters in a product detail. The wrong encoding of that
particular data element causes the XML itself to be not well formed.

This results in a parser error as follows:

":1: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x91 0x65 0x61 0x73 "

Then it breaks.

We tried to use ( $parser-> recover(1); ) so that the parser can skip
over the error.
Unfortunately, this is not enough since we have no way of knowing that
an error occurred and that the parser returned bad data for that
product. We cannot validate the products for wrong data (no specific
data format expected).

What we need is some sort of error handling within the parsing method
that will detect non-UTF-8 data, raise an error, and then SKIP over
the whole product in the XML block that contains that error, and
continue to parse the rest of the XML document normally.

So we need your advice to solve this problem or a work around.

Example of the perl code used to parse the message:
###################################################################
    my $parser=XML::LibXML-> new();                # create new object of
LibXML
    # $parser-> recover(1);
    my $tree=$parser-> parse_string($xml_msg);   # start to parse xml
file
    my $root=$tree-> getDocumentElement;        # get the root element
<body> 

    my $count = 0;
    my @ResultSet = ();

    foreach my $product ($root-> findnodes('product')){
        $ResultSet[$count][0] = $product-> findvalue('productID');
$ResultSet[$count][1] = $product-> findvalue('name');
$ResultSet[$count][3] = $product-> findvalue('categoryName');
$ResultSet[$count][4] = $product-> findvalue('categoryID');
$ResultSet[$count][5] = $product-> findvalue('price');
$count++;
    }
###################################################################

Thank you and best regards.
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Andrew Strader
johns
Vaclav Barta
A. Pagaltzis
Ciaran Hamilton

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved