Re: Handling ? (pound sterling) symbols in content
by Neil Hughes other posts by this author
Jun 28 2009 2:22AM messages near this date
view in the new Beta List Site
Re: Handling ? (pound sterling) symbols in content
|
Re: Handling ? (pound sterling) symbols in content
& XSLT Thanks Grant....that fixed it.
The database originally started with a DOS application and still encodes
'£' symbols weirdly, so my other (non-Perl) Windows application was
extracting the data into XML and adding the correct '£' symbol, but I
was not specifying the encoding as you pointed out.
If I specify "WINDOWS-1252" or "ISO-8859-1" my XML::Twig code can parse
it without a problem. I'll probably use the former because I suspect
there are Euro symbols in the data somewhere.
Much appreciated
--
Neil Hughes
On 27/6/09 22:39, Grant McLean wrote:
> Hi Neil
>
> You need to determine what encoding has been used by the database export
> process.
>
> If you're working with a Windows system then the most likely guess is
> that the data is encoded with CP1252 or 'Win-Latin-1'. In which case
> the first line of the XML file should specify an encoding like this:
>
> <?xml version="1.0" encoding="WINDOWS-1252" ?>
>
> If the data was encoded with UTF-8 then the XML parser module would have
> recognised it automatically, so you can eliminate that option.
>
> You can also safely assume that the data is not in ISO-8859-1 (Latin-1),
> because that encoding pre-dates the definition of the Euro symbol.
>
> Encodings in XML (and Perl) are a largish subject with many subtleties,
> you can read more here:
>
> http://perl-xml.sourceforge.net/faq/#encodings
>
> Cheers
> Grant
>
> On Sat, 2009-06-27 at 22:24 +0100, Neil Hughes wrote:
> > I've hit a problem in XML::Twig trying to handle data exported from a
> > legacy database, but I suspect this is an issue I need to get some
> > advice on regardless of the parser...
> >
> > The data contains '£' symbols which I'm struggling to format in XML for
> > processing later on. The following code might help explain:
> >
> > ------------ BEGIN --------------
> >
> > use strict;
> > use warnings;
> >
> > use XML::Twig;
> >
> > my $t= XML::Twig->new();
> >
> > # this is OK
> > #my $input = '<?xml
> > version="1.0"?><root><item>one</item><item>two</item><item>three</item></root>';
> >
> >
> > # this is invalid
> > #my $input = '<?xml version="1.0"?><root><item>one
> > £</item><item>two</item><item>three</item></root>';
> >
> > # this is OK
> > #my $input = '<?xml
> > version="1.0"?><root><item><![CDATA[one]]></item><item><![CDATA[two]]></item><item><![CDA
TA[three]]> </item></root>';
> >
> >
> > # this is invalid
> > my $input = '<?xml version="1.0"?><root><item><![CDATA[one
> > £]]></item><item><![CDATA[two]]></item><item><![CDATA[three]]></item></root>';
> >
> >
> > $t->parse($input);
> > $t->print;
> >
> > ------------ END --------------
> >
> > Whether I wrap my text data in CDATA or not, as soon as I include a
> > pound sterling symbol I get the following error:
> >
> > not well-formed (invalid token) at line 1, column 46, byte 46 at
> > /usr/local/ActivePerl-5.8/lib/XML/Parser.pm line 187
> > at /Users/nkh/Documents/Dev/Perl/xml_twig/pound_test1.pl line 14
> >
> > Byte 46 seems to align with the '£', so I'm wondering what I need to do
> > to get this character not to break the parser.
> >
>
> _______________________________________________
> Perl-XML mailing list
> Perl-XML@[...].com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Thread:
Neil Hughes
Mirod
Neil Hughes
Grant McLean
Dave Howorth
Neil Hughes
Dave Howorth
|