ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> i18n-sig
i18n-sig
Re: [I18n-sig] First draft of Unicode howto
by "Martin v. Löwis" other posts by this author
Aug 7 2005 8:35AM messages near this date
[I18n-sig] First draft of Unicode howto | Re: [I18n-sig] First draft of Unicode howto
A.M. Kuchling wrote:
>  The 'Tips for Writing Unicode-aware Programs' is also very sparse,
>  because I couldn't come up with much of anything very helpful.
>  Suggestions for this section would also be appreciated.  

Some remarks as I go through:
- UTF-8 uses 4 bytes, for characters above U+10000 (i.e. non-BMP
  characters), and 3 bytes in the range U+0800...U+FFFF

- if you want to, you can further restrict the value ranges for
  the UTF-8 bytes: the 2nd, 3rd, fourth byte are always between
  128 and 191; the first byte is 192..223 for two-byte, 224..239
  for three-byte, and 240..247 for four-byte sequences.

  Because of this property, you can resynchronize (not that I'm
  aware of any application that commonly uses resynchronization).
  But, for the same reason, it is unlikely that you encounter
  bytes that look like UTF-8 but aren't.

- The example for Unicode literals with encoding errors renders
  incorrectly (I see a question mark)

- If you mention Unicode character categories, you should elaborate
  a bit. Unicode categories are things like "Letter", "Symbol",
  "Punctuation", with subcategories like "Uppercase" or "Dash".
  The list of all categories is at

http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

- reading data: you could point out that IO libraries sometimes
  already input and output Unicode directly, with the most
  prominent examples being GUI, XML, and databases; developers
  should check whether their library supports Unicode.

Regards,
Martin
_______________________________________________
I18n-sig mailing list
I18n-sig@[...].org
http://mail.python.org/mailman/listinfo/i18n-sig
Thread:
A.M. Kuchling
"Martin v. Löwis"
M.-A. Lemburg

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved