[Python-Dev] String encoding

Fred L. Drake fdrake@acm.org
Tue, 23 May 2000 08:13:59 -0700 (PDT)


On Tue, 23 May 2000, M.-A. Lemburg wrote:
 > The problem is that "s" and "t" return C pointers to some
 > internal data structure of the object. It has to be assured
 > that this data remains intact at least as long as the object
 > itself exists.
 > 
 > AFAIK, this cannot be fixed without creating a memory leak.
 >  
 > The "es" parser marker uses a different strategy, BTW: the
 > data is copied into a buffer, thus detaching the object
 > from the data.
 > 
 > > > C APIs which want to support Unicode should be fixed to use
 > > > "es" or query the object directly and then apply proper, possibly
 > > > OS dependent conversion.
 > > 
 > > for convenience, it might be a good idea to have a "wide system
 > > encoding" too, and special parser markers for that purpose.
 > > 
 > > or can we assume that all wide system API's use unicode all the
 > > time?
 > 
 > At least in all references I've seen (e.g. ODBC, wchar_t
 > implementations, etc.) "wide" refers to Unicode.

  On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
10646 require a 32-bit space?
  I recall a fair bit of discussion about wchar_t when it was introduced
to ANSI C, and the character set and encoding were specifically not made
part of the specification.  Making a requirement that wchar_t be Unicode
doesn't make a lot of sense, and opens up potential portability issues.

-1 on any assumption that wchar_t is usefully portable.


  -Fred


-- 
Fred L. Drake, Jr.  <fdrake at acm.org>