ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl5-porters
perl5-porters
Re: encoding neutral unpack
by Rafael Garcia-Suarez other posts by this author
Jan 31 2005 8:10AM messages near this date
Re: encoding neutral unpack | Re: encoding neutral unpack
Ton Hospel wrote:
>  I can trivially reverse the meaning for U0 and C0 in my patch of course,
>  and make the default starting mode U0. But it still wouldn't give you what you
>  want since C0 mode would still work on the (at least conceptually) upgraded
>  string, and it still wouldn't see "through" the encoding.
>  
>  I could of course make the (new) C0 be "see through" (dropping the to my
>  mind also usefull "process the utf8 bytes"), but then you'd STILL not
>  get unpack("C*", $str) to be the underlying bytes (since we now by default
>  start in (new) U0 mode), it would have to be unpack("C0C*", $str). So we
>  can add yet another rule: if the pack format starts with C, we have an
>  implicit C0, and then unpack("C*", $str) would indeed do what you want.
>  
>  But we'd have thrown out the baby with the bathwater. Because there is this
>  basic problem:
>  
>  - user has some string like "àbc", and he expect unpack("C*", $_) to return
>    (224, 98, 99)
>  1) We want to be encoding neutral, so if the string
>     (accidentally) gets upgraded, utf8::upgrade($_); unpack("C*", $_) should
>     STILL return (224, 98, 99)
>  2) We want to be backward compatible, so the upgraded string should return
>     the underlying bytes.  utf8::upgrade($_); unpack("C*", $_) should
>     return (195, 160, 98, 99)
>  
>  Notice there was no mention of C0 or U0 modes here. Even so, 1) and 2)
>  are clearly incompatible.
>  So we'd have to document that he has to undo the implicit C0 in C* by doing
>  unpack("U0C*", $_) to get an encoding neutral C*

Right.

>  To me that makes things more icky than breaking backward incompatibity does.
>  I don't want the user to have to do U0C*, he should just get 1) by default.
>  Wanting to "see through" the encoding is the non-standard behaviour that
>  should carry the burden of adding special code.

However you're appealing to the Rules of Huffman Coding here.
I'm about to be convinced :) if someone else dares to comment...

>  And deciding that 1) is the right behaviour is enough to need *some*
>  patches, for example it implies ext/Encode/lib/Encode/MIME/Header.pm needs
>  a change. Also notice that by writing "U0C*" in these places you get code
>  that works under both the old perl behaviour and under the behaviour my
>  patch provides.

OK.

>  So I basically argue:
>  
>   1) being "encoding neutral" and "backward compatble (see through)" is
>      fundamentally incompatible. And "encoding neutral" is the more
>      important one.
>   2) We can get "see through encoding" already (and portable to older perls)
>      with "use bytes". And in all places it's used to get the utf-8 expansion
>      of bytes you can portably use "U0C*" even without "use bytes"
>   3) Since "see through" was the main motivation for the current C0 and U0
>      meanings anyways, we can just as well change them to the more consistent
>      meaning
>  
>  > (Waiting for the separately sumbitted patches...)
>  
>  Mm, they were sent before this mail. They should be on the mailinglist
>  already.

They are on the archive, but not in my mailbox :( apparently we lost a
few mails this day on this side of internet. I'll look at them soon.

-- 
God wants blood victim. Birth, hymen, martyr, war, foundation of a building,
sacrifice, kidney burntoffering, druids' altars.
    -- Ulysses
Thread:
Ton Hospel
Rafael Garcia-Suarez
Glenn Linderman

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved