Re: [boost] Call for interest for native unicode character and string support in boost
by Graham other posts by this author
Jul 28 2005 11:35AM messages near this date
Re: [boost] [1.33.0] Final release schedule
|
Re: [boost] Call for interest for native unicode character and string support in boost
SOURCE > From: Rogier van Dalen <rogiervd@[...].com>
> Subject: Re: [boost] Call for interest for native unicode character
> and string support in boost
>
> Great, this seems a good first step. Glad to see things moving. I'll
> give my comments, but I hope Erik will step in so we can see what he's
> got.
>
> > I welcome comments.
> >
> >I agree with the general idea.
> >First, http://www.boost.org/more/lib_guide.htm#Guidelines has coding
> >guidelines. In general, your code looks slightly C-ish. The Boost
> >habit is to use the ".hpp" extension for C++ headers. You attached a
> file "unicode.hpp" but talk about "Unicode.hpp": note that these are
> different names.
> I suggest we make a namespace "unicode" rather than prepending
> everything with "uni". The enums had probably better be put in
> structs.
>
> namespace unicode {
> struct range {
> enum type {
> latin1_supplement,
> latin_extended_a,
> latin_extended_a,
> ipa_extensions,
> // ...
> }
> };
> }
Yes - it should be namespaced - I had omitted it for clarity.
I still think that the uni prefix might be useful to remind those
programmers using 'using unicode' that these are Unicode functions - but
I am happy to lose that argument.
> The fact that I find "Hungarian notation" ugly and meaningless is
> probably irrelevant, but it's not the way it's generally done in
> Boost.
> char32_t is not yet a part of the C++ standard, I believe. I'm not
> sure, maybe we'd better call it "codepoint" anyway, and use #ifdef'ed
> typedef's.
> BOOL is not C++; it is spelled "bool". DWORD doesn't exist either; I
> believe you mean uint16_t (sic) for the collation data, if I
> understand correctly what the methods are doing. But I think collation
> should not be in this header yet, but rather be inserted later, when
> the string classes are defined.
Oops - caught - I was attempting to write it in such a way that it could
be used from C as well as C++ - hence BOOL not bool.
DWORD is actually uint32_t.
I believe collation must be here as there will be probably be several
containers with Unicode characteristics and this is a good level for
them to work on.
> Case conversion should probably take output iterators. That'll get rid
> of the complex/simple division. The methods should probably be
> templated as well, and take ranges rather than counts.
> template <class InputIterator, class Outputiterator>
> lowercase (InputIterator first, InputIterator last, OutputIterator
> result);
> template <class InputIterator, class Outputiterator>
> uppercase (InputIterator first, InputIterator last, OutputIterator
> result);
I like this but we will still need to have a complex/simple division.
However using iterators the complex can do both, and the simple then
becomes GetSimpleLowercase for case conversion without changing length,
but it can again take an output iterator.
> The break functions:
> Couldn't these take iterators as well? For all use cases I can think
> of, this would be a much easier version to use:
> template <class InputIterator>
> InputIterator advance_grapheme (InputIterator position,
InputIterator > last);
> (etc.)
When I did my original coding I coding each of following:
GetStartOfGrapheme
GetPreviousGrapheme
GetNextGrapheme
I found that just be having IsStartOfGrapheme all these became really
simple routines.
I therefore believe extremely strongly that it is necessary to have
StartOfGrapheme and that the others like GetNextGrapheme or
advancegrapheme will then be simple/inline wrappers that use
StartOfGrapheme.
I also found that there was a coding hit if you have to test start and
end iterator positions when processing the grapheme, hence I was passing
in three DWORDs.
Having said that, allowing inline versions that take iterators to call
the core uint32_t/ [DWORD] functions would a good thing, and I would
expect this to happen.
> Finally, just thinking out loud: both the case mappings and collation
> have default (non-locale-specific) and tailored modes. Shouldn't those
> best be represented by classes rather than free functions, and
> shouldn't there thus be a global variable "default" that provides
> default operations, and other objects for locale-specific operations?
Unicode case mappings are locale inspecific.
I do not intend to handle any code page conversions at this stage - that
can be added on later and should be handled separately in a separate
discussion. Those conversions would not be Unicode conversions and I
believe that discussion should be postponed for a later date.
Yours,
Graham
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thread:
Graham
Rogier van Dalen
|