unicode, C++, python 2.2

Sun Sep 11 16:21:44 EDT 2005

Trond Eivind Glomsrød wrote:
> I am currently writing a python interface to a C++ library.  Some of the
> functions in this library take unicode strings (UTF-8, mostly) as
> arguments.
> 
> However, when getting these data I run into problem on python 2.2
> (RHEL3) - while the data is all nice UCS4 in 2.3, in 2.2 it seems to be
> UTF-8 on top of UCS4.  UTF8 encoded in UCS4, meaning that 3 bytes of the
> UCS4 char is 0 and the first one contains a byte of the string encoding
> in UTF-8.
> 
> Is there a trick to get python 2.2 to do UCS4 more cleanly?

It's hard to tell from your message what your problem really is, as we
have not clue what "these data" are. How do you know they are "nice
UCS4" in 2.3? Are you looking at the internal representation at the
C level, or are you looking at something else? Do you use byte strings
or Unicode strings?

You tried to explain what "UTF8 encoded in UCS4" might be, but I'm
not sure I understand the explanation: what precise sequence of
statements did you use to create such a thing, and what precisely
does it look like (what exact byte is first, what is second, and so
on)?

Regards,
Martin