ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl5-porters
perl5-porters
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
by Demerphq other posts by this author
May 23 2007 10:20AM messages near this date
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE) | Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
On 5/23/07, Tels <nospam-abuse@[...].com>  wrote:
>  -----BEGIN PGP SIGNED MESSAGE-----
>  Hash: SHA1
> 
>  Moin,
> 
>  On Wednesday 23 May 2007 15:53:14 demerphq wrote:
>  > Hi Dan,
>  >
>  > I was wondering if there is some way to get Encode to emit the little
>  > endian version of UTF-16 (with BOM) as a typical Win32 on Intel app
>  > would do. It seems to me that currently
>  >
>  > my $octets= encode('UTF-16',$string);
>  >
>  > will only emit the big-endian form of it.
> 
>  As far as I gleaned from working with UTF, this is right. (or in other
>  words, UTF-16BE is just an alias for UTF-16), but I could be wrong.

No, thats not correct. UTF-16 files can be either big endian or little
endian and must start with a Byte Order Mark, codepoint U+FEFF, which
is used to determine what their endianness is. UTF-16LE and UTF-16BE
are encodings with a specific endianess and do not start with a BOM.

>  > Of course well behaved apps shouldnt care, but some do, also i know I
>  > can hand emit the BOM myself like so:
>  >
>  > my $octets= encode('UTF-16LE',chr(0xFEFF).$string);
>  >
>  > but this strck me as a bit convoluted and makes it a bit tricky to do
>  > with IO layers. If there isnt a way to do it currently maybe the name
>  > 'UTF-16:le' or something similar could be used for this?
> 
>  I am not sure I understand your question, since you showed it is possible to
>  get UTF-16LE, so what exactly do you want more? :)
> 
>  Shouldn't then:
> 
>          binmode ($FILE, 'UTF-16LE') or die("$!");
> 
>  just work?

Yes it works, but it doesnt ensure the file starts with a BOM. Which
is easily enough done by hand, but as i said above is a touch
annoying. I can imagine scenarios where its not clear whose
responsibility it is to add the BOM. I actually was trying to write a
utf-8 to utf-16 converter (long story) but the files are different
from that provided by most win32 tools i used for comparision as they
emit the little-endian variant instead.

Also it struck me as weird that UTF-16 in perl is alway big endian
even on a little endian architecture. Obviously its easier to test
this way.

Imo it would be cool to have a way to control it in code without hand
adding the BOM.

Ill do a patch if there isnt already a way to do it, i just wanted to
be sure before i look into it, and since Dan knows the code what could
take me a while to do would probably be the work of a few minutes for
him so i figured id see what he had to say first.

cheers,
yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread:
Demerphq
Tels
Demerphq
Tels
Demerphq
Tels

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved