Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
by Demerphq other posts by this author
May 23 2007 11:16AM messages near this date
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
|
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
On 5/23/07, Tels <nospam-abuse@[...].com> wrote:
> On Wednesday 23 May 2007 17:20:15 demerphq wrote:
> > On 5/23/07, Tels <nospam-abuse@[...].com> wrote:
> > > On Wednesday 23 May 2007 15:53:14 demerphq wrote:
> > > > Hi Dan,
> > > >
> > > > I was wondering if there is some way to get Encode to emit the little
> > > > endian version of UTF-16 (with BOM) as a typical Win32 on Intel app
> > > > would do. It seems to me that currently
> > > >
> > > > my $octets= encode('UTF-16',$string);
> > > >
> > > > will only emit the big-endian form of it.
> > >
> > > As far as I gleaned from working with UTF, this is right. (or in other
> > > words, UTF-16BE is just an alias for UTF-16), but I could be wrong.
> >
> > No, thats not correct. UTF-16 files can be either big endian or little
> > endian and must start with a Byte Order Mark, codepoint U+FEFF, which
> > is used to determine what their endianness is.
>
> As far as I read the wiki entry, they "should" but not "must". Of course,
> the BOM makes things much easier.
>
> Quote:
>
> "If the BOM is missing, barring any indication of byte order from
> higher-level protocols, big endian is to be used or assumed."
Quote (emphasis added):
The UTF-16 (and UCS-2) encoding scheme allows either endian
representation to be used, but *mandates* that the byte order *should*
be *explicitly* indicated by prepending a Byte Order Mark before the
first serialized character.
Quote (emphasis added):
*Technically*, with the UTF-16 scheme the BOM prefix is optional, but
omitting it is *not recommended* as UTF-16LE or UTF-16BE should be
used instead. If the BOM is missing, barring any indication of byte
order from higher-level protocols, big endian is to be used or
assumed. The BOM is *not optional* in the UCS-2 scheme.
Which makes that about the closest you can come to a MUST while still
being a SHOULD that I can imagine.
Furthermore Windows historically did UCS-2 and therfore i think its
generally accepted in the Win32 world that the BOM is not optional.
> > UTF-16LE and UTF-16BE
> > are encodings with a specific endianess and do not start with a BOM.
>
> Erm, see above.
>
> And that still doesn't answer how you know which endianess to emit when the
> conversion only specifies "UTF-16".
I dont really care which, i just want to be able to control things
easier without fighting over the BOM. For instance how do i use the
open pragma to specify that i want all files emitted in the little
endian form of UTF-16 (with BOM)?
Thus id like to be able to say
use open ':encoding(UTF-16)'; # I dont care about endianess, the BOM will tell
use open ':encoding(UTF-16:be)'; # big endian with BOM
use open ':encoding(UTF-16:le)'; # little endian with BOM
use open ':encoding(UTF-16BE)'; # big endian without BOM
use open ':encoding(UTF-16LE)'; # little endian without BOM
> When you say "UTF-16", Encode can either:
>
> * always ommit the BOM and emit BE
> * send a BOM and let the BE or LE be determined by random chance, the
> architeture, or always be BE
yes, if you want to be this pedantic about the definition you are
right. However i see it like smoking in bed: its something you should
not do but that doesnt mean its in the slightest bit smart to do so
even though it isnt strictly illegal.
> > > > Of course well behaved apps shouldnt care, but some do, also i know I
> > > > can hand emit the BOM myself like so:
> > > >
> > > > my $octets= encode('UTF-16LE',chr(0xFEFF).$string);
> > > >
> > > > but this strck me as a bit convoluted and makes it a bit tricky to do
> > > > with IO layers. If there isnt a way to do it currently maybe the name
> > > > 'UTF-16:le' or something similar could be used for this?
> > >
> > > I am not sure I understand your question, since you showed it is
> > > possible to get UTF-16LE, so what exactly do you want more? :)
> > >
> > > Shouldn't then:
> > >
> > > binmode ($FILE, 'UTF-16LE') or die("$!");
> > >
> > > just work?
> >
> > Yes it works, but it doesnt ensure the file starts with a BOM. Which
> > is easily enough done by hand, but as i said above is a touch
> > annoying. I can imagine scenarios where its not clear whose
> > responsibility it is to add the BOM. I actually was trying to write a
> > utf-8 to utf-16 converter (long story) but the files are different
> > from that provided by most win32 tools i used for comparision as they
> > emit the little-endian variant instead.
> >
> > Also it struck me as weird that UTF-16 in perl is alway big endian
> > even on a little endian architecture. Obviously its easier to test
> > this way.
>
> > Imo it would be cool to have a way to control it in code without hand
> > adding the BOM.
>
> I guess that adding the BOM when you request UTF-16BE and UTF-16LE would be
> a first start, but the wiki contradicts itself there:
>
> "However rather than using a BOM prepended to the data, the byte order used
> is implicit in the name of the encoding scheme (LE for little-endian, BE
> for big-endian). Since a BOM is specifically not to be prepended in these
> schemes, if an encoded ZWNBSP character is found at the beginning of any
> data encoded by these schemes is not to be considered to be a BOM, but
> instead is considered part of the text itself. In practice most software
> will ignore these "accidental" BOMs.
This seems clear to me, the only way you know you are dealing with
UTF-16LE or UTF-16BE is because you have specified it to be so, and
therefore if the file starts with codepoint U+FEFF it is assumed to
have its non BOM meaning which is that of an invisible character.
Whereas in UTF-16 the BOM is not considered to be partof the document.
So a UTF-16 file that consisted of:
0xFEFF, 0xFEFF
would be considered to contain only one character, ZWNBSP wheras with
UTF-16BE and UTF-16LE it would be considered to contain TWO
characters.
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread:
Demerphq
Tels
Demerphq
Tels
Demerphq
Tels
|