ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl5-porters
perl5-porters
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
by Demerphq other posts by this author
May 23 2007 11:16AM messages near this date
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE) | Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
On 5/23/07, Tels <nospam-abuse@[...].com>  wrote:
>  On Wednesday 23 May 2007 17:20:15 demerphq wrote:
>  > On 5/23/07, Tels <nospam-abuse@[...].com> wrote:
>  > > On Wednesday 23 May 2007 15:53:14 demerphq wrote:
>  > > > Hi Dan,
>  > > >
>  > > > I was wondering if there is some way to get Encode to emit the little
>  > > > endian version of UTF-16 (with BOM) as a typical Win32 on Intel app
>  > > > would do. It seems to me that currently
>  > > >
>  > > > my $octets= encode('UTF-16',$string);
>  > > >
>  > > > will only emit the big-endian form of it.
>  > >
>  > > As far as I gleaned from working with UTF, this is right. (or in other
>  > > words, UTF-16BE is just an alias for UTF-16), but I could be wrong.
>  >
>  > No, thats not correct. UTF-16 files can be either big endian or little
>  > endian and must start with a Byte Order Mark, codepoint U+FEFF, which
>  > is used to determine what their endianness is.
> 
>  As far as I read the wiki entry, they "should" but not "must". Of course,
>  the BOM makes things much easier.
> 
>  Quote:
> 
>          "If the BOM is missing, barring any indication of byte order from
>           higher-level protocols, big endian is to be used or assumed."

Quote (emphasis added):

The UTF-16 (and UCS-2) encoding scheme allows either endian
representation to be used, but *mandates* that the byte order *should*
be *explicitly* indicated by prepending a Byte Order Mark before the
first serialized character.

Quote (emphasis added):

*Technically*, with the UTF-16 scheme the BOM prefix is optional, but
omitting it is *not recommended* as UTF-16LE or UTF-16BE should be
used instead. If the BOM is missing, barring any indication of byte
order from higher-level protocols, big endian is to be used or
assumed. The BOM is *not optional* in the UCS-2 scheme.

Which makes that about the closest you can come to a MUST while still
being a SHOULD that I can imagine.

Furthermore Windows historically did UCS-2 and therfore i think its
generally accepted in the Win32 world that the BOM is not optional.

>  > UTF-16LE and UTF-16BE
>  > are encodings with a specific endianess and do not start with a BOM.
> 
>  Erm, see above.
> 
>  And that still doesn't answer how you know which endianess to emit when the
>  conversion only specifies "UTF-16".

I dont really care which, i just want to be able to control things
easier without fighting over the BOM. For instance how do i use the
open pragma to specify that i want all files emitted in the little
endian form of UTF-16 (with BOM)?

Thus id like to be able to say

use open ':encoding(UTF-16)'; # I dont care about endianess, the BOM will tell
use open ':encoding(UTF-16:be)'; # big endian with BOM
use open ':encoding(UTF-16:le)'; # little endian with BOM
use open ':encoding(UTF-16BE)'; # big endian without BOM
use open ':encoding(UTF-16LE)'; # little endian without BOM

>  When you say "UTF-16", Encode can either:
> 
>  * always ommit the BOM and emit BE
>  * send a BOM and let the BE or LE be determined by random chance, the
>    architeture, or always be BE

yes, if you want to be this pedantic about the definition you are
right. However i see it like smoking in bed: its something you should
not do but that doesnt mean its in the slightest bit smart to do so
even though it isnt strictly illegal.

>  > > > Of course well behaved apps shouldnt care, but some do, also i know I
>  > > > can hand emit the BOM myself like so:
>  > > >
>  > > > my $octets= encode('UTF-16LE',chr(0xFEFF).$string);
>  > > >
>  > > > but this strck me as a bit convoluted and makes it a bit tricky to do
>  > > > with IO layers. If there isnt a way to do it currently maybe the name
>  > > > 'UTF-16:le' or something similar could be used for this?
>  > >
>  > > I am not sure I understand your question, since you showed it is
>  > > possible to get UTF-16LE, so what exactly do you want more? :)
>  > >
>  > > Shouldn't then:
>  > >
>  > >         binmode ($FILE, 'UTF-16LE') or die("$!");
>  > >
>  > > just work?
>  >
>  > Yes it works, but it doesnt ensure the file starts with a BOM. Which
>  > is easily enough done by hand, but as i said above is a touch
>  > annoying. I can imagine scenarios where its not clear whose
>  > responsibility it is to add the BOM. I actually was trying to write a
>  > utf-8 to utf-16 converter (long story) but the files are different
>  > from that provided by most win32 tools i used for comparision as they
>  > emit the little-endian variant instead.
>  >
>  > Also it struck me as weird that UTF-16 in perl is alway big endian
>  > even on a little endian architecture. Obviously its easier to test
>  > this way.
> 
>  > Imo it would be cool to have a way to control it in code without hand
>  > adding the BOM.
> 
>  I guess that adding the BOM when you request UTF-16BE and UTF-16LE would be
>  a first start, but the wiki contradicts itself there:
> 
>          "However rather than using a BOM prepended to the data, the byte order used
>           is implicit in the name of the encoding scheme (LE for little-endian, BE
>           for big-endian). Since a BOM is specifically not to be prepended in these
>           schemes, if an encoded ZWNBSP character is found at the beginning of any
>           data encoded by these schemes is not to be considered to be a BOM, but
>           instead is considered part of the text itself. In practice most software
>           will ignore these "accidental" BOMs.

This seems clear to me, the only way you know you are dealing with
UTF-16LE or UTF-16BE is because you have specified it to be so, and
therefore if the file starts with codepoint U+FEFF it is assumed to
have its non BOM meaning which is that of an invisible character.
Whereas in UTF-16 the BOM is not considered to be partof the document.
So a UTF-16 file that consisted of:

  0xFEFF, 0xFEFF

would be considered to contain only one character, ZWNBSP wheras with
UTF-16BE and UTF-16LE it would be considered to contain TWO
characters.

yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread:
Demerphq
Tels
Demerphq
Tels
Demerphq
Tels

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved