ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> boost
boost
Re: [boost] Call for interest for native unicode character and string support in boost
by Graham other posts by this author
Jul 28 2005 11:35AM messages near this date
Re: [boost] [1.33.0] Final release schedule | Re: [boost] Call for interest for native unicode character and string support in boost
SOURCE > From: Rogier van Dalen <rogiervd@[...].com>
> Subject: Re: [boost] Call for interest for native unicode character
> 	and	string support in boost
> 
> Great, this seems a good first step. Glad to see things moving. I'll
> give my comments, but I hope Erik will step in so we can see what he's
> got.
> 
> > I welcome comments.
> >
> >I agree with the general idea.
> >First, http://www.boost.org/more/lib_guide.htm#Guidelines has coding
> >guidelines. In general, your code looks slightly C-ish. The Boost
> >habit is to use the ".hpp" extension for C++ headers. You attached a
> file "unicode.hpp" but talk about "Unicode.hpp": note that these are
> different names.
> I suggest we make a namespace "unicode" rather than prepending
> everything with "uni". The enums had probably better be put in
> structs.
> 
> namespace unicode {
>     struct range {
>         enum type {
>             latin1_supplement,
>             latin_extended_a,
>             latin_extended_a,
>             ipa_extensions,
>            // ...
>         }
>     };
> }
Yes - it should be namespaced - I had omitted it for clarity.
I still think that the uni prefix might be useful to remind those
programmers using 'using unicode' that these are Unicode functions - but
I am happy to lose that argument.

> The fact that I find "Hungarian notation" ugly and meaningless is
> probably irrelevant, but it's not the way it's generally done in
> Boost.
> char32_t is not yet a part of the C++ standard, I believe. I'm not
> sure, maybe we'd better call it "codepoint" anyway, and use #ifdef'ed
> typedef's.
> BOOL is not C++; it is spelled "bool". DWORD doesn't exist either; I
> believe you mean uint16_t (sic) for the collation data, if I
> understand correctly what the methods are doing. But I think collation
> should not be in this header yet, but rather be inserted later, when
> the string classes are defined.

Oops - caught - I was attempting to write it in such a way that it could
be used from C as well as C++ - hence BOOL not bool.
DWORD is actually uint32_t.
I believe collation must be here as there will be probably be several
containers with Unicode characteristics and this is a good level for
them to work on.

> Case conversion should probably take output iterators. That'll get rid
> of the complex/simple division. The methods should probably be
> templated as well, and take ranges rather than counts.
> template <class InputIterator, class Outputiterator>
>     lowercase (InputIterator first, InputIterator last, OutputIterator
> result);
> template <class InputIterator, class Outputiterator>
>     uppercase (InputIterator first, InputIterator last, OutputIterator
> result);
I like this but we will still need to have a complex/simple division.
However using iterators the complex can do both, and the simple then
becomes GetSimpleLowercase for case conversion without changing length,
but it can again take an output iterator.

> The break functions:
> Couldn't these take iterators as well? For all use cases I can think
> of, this would be a much easier version to use:
> template <class InputIterator>
>     InputIterator advance_grapheme (InputIterator position,
InputIterator > last);
> (etc.)
When I did my original coding I coding each of following:
GetStartOfGrapheme
GetPreviousGrapheme
GetNextGrapheme
I found that just be having IsStartOfGrapheme all these became really
simple routines.
I therefore believe extremely strongly that it is necessary to have
StartOfGrapheme and that the others like GetNextGrapheme or
advancegrapheme will then be simple/inline wrappers that use
StartOfGrapheme.
I also found that there was a coding hit if you have to test start and
end iterator positions when processing the grapheme, hence I was passing
in three DWORDs.
Having said that, allowing inline versions that take iterators to call
the core uint32_t/ [DWORD] functions would a good thing, and I would
expect this to happen.

> Finally, just thinking out loud: both the case mappings and collation
> have default (non-locale-specific) and tailored modes. Shouldn't those
> best be represented by classes rather than free functions, and
> shouldn't there thus be a global variable "default" that provides
> default operations, and other objects for locale-specific operations?

Unicode case mappings are locale inspecific. 

I do not intend to handle any code page conversions at this stage - that
can be added on later and should be handled separately in a separate
discussion. Those conversions would not be Unicode conversions and I
believe that discussion should be postponed for a later date.

Yours,

Graham

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thread:
Graham
Rogier van Dalen

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved