[Python-Dev] Python's Unicode width default (New Py_UNICODE doc)

Bob Ippolito bob at redivi.com
Sat May 14 21:39:17 CEST 2005


On May 14, 2005, at 3:05 PM, Shane Hathaway wrote:

> M.-A. Lemburg wrote:
>
>> It is important to be able to rely on a default that
>> is used when no special options are given. The decision
>> to use UCS2 or UCS4 is much too important to be
>> left to a configure script.
>>
>
> Should the choice be a runtime decision?  I think it should be.  That
> could mean two unicode types, a call similar to
> sys.setdefaultencoding(), a new unicode extension module, or  
> something else.
>
> BTW, thanks for discussing these issues.  I tried to write a patch to
> the unicode API documentation, but it's hard to know just what to  
> write.
>  I think I can say this: "sometimes your strings are UTF-16, so you're
> working with code units that are not necessarily complete code points;
> sometimes your strings are UCS4, so you're working with code units  
> that
> are also complete code points.  The choice between UTF-16 and UCS4 is
> made at the time the Python interpreter is compiled and the default
> choice varies by operating system and configuration."

Well, if you're going to make it runtime, you might as well do it  
right.  Take away the restriction that the unicode type backing store  
is forced to be a particular encoding (i.e. get rid of  
PyUnicode_AS_UNICODE) and give it more flexibility.

The implementation of NSString in OpenDarwin's libFoundation <http:// 
libfoundation.opendarwin.org/> (BSD license), or the CFString  
implementation in Apple's CoreFoundation <http://developer.apple.com/ 
darwin/cflite.html> (APSL) would be an excellent place to look for  
how this can be done.

Of course, for backwards compatibility reasons, this would have to be  
a new type that descends from basestring.  text would probably be a  
good name for it.  This would be an abstract implementation, where  
you can make concrete subclasses that actually implement the various  
operations as necessary.  For example, you could have text_ucs2,  
text_ucs4, text_ascii, text_codec, etc.

The bonus here is you can get people to shut up about space efficient  
representations, because you can use whatever makes sense.

-bob



More information about the Python-Dev mailing list