[Python-Dev] str() vs. unicode()

Guido van Rossum guido@python.org
Tue, 25 Sep 2001 01:17:07 -0400


> Ok, let's remove the buffer API from unicode(). Should it still be
> maintained for unicode(obj, encoding, errors) ?

I think so yes.

> Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. 
> It would certainly help clarify the communication between the 
> interpreter and the object.

Would you settle for a __unicode__ method but no tp_unicode slot?
It's easy enough to define a C method named __unicode__ if the need
arises.  This should always be tried first, not just for classic
instances.  Adding a slot is a bit painful now that there are so many
new slots already (adding it to the end means you have to add tons of
zeros, adding it to the middle means I have to edit every file).

> > To convert one of these to Unicode given an encoding, shouldn't their
> > decode() method be used?
> 
> Right... perhaps we don't need __unicode__ after all: the .decode()
> method already provides this functionality (on strings at least).

So maybe we should deprecate unicode(obj, encoding[, error]) and
recommend obj.decode(encoding[, error]) instead.  But this means that
objects with a buffer API but no decode() method cannot efficiently be
decoded.  That's what unicode(obj, encoding[, error]) was good for.

To decide, we need to know how useful it is in practice to be able to
decode buffers -- I doubt it is very useful, since most types
supporting the buffer API are not text but raw data like memory-mapped
files, arrays, PIL images.

> > Really, this is such an incredible morass of APIs that I wonder if we
> > shouldn't start over...  There are altogether too many places in the
> > code where PyUnicode_Check() is used.  I wish there was a better
> > way...
> 
> Ideally, we'd need a new base class for strings and then have 8-bit 
> and Unicode be subclasses of the this base class. There are several
> problems with this approach though; one certainly being the different
> memory allocation mechanisms used (strings store the value in the
> object, Unicode references an external buffer), the other
> being the different nature: strings don't carry meta-information
> while Unicode is in many ways restricted in use.

I've thought of defining an abstract base class "string" from which
both str and unicode derive.  Since str and unicode don't share
representation, they shouldn't share implementation, but they could
still share interface.  Certainly conceptually this is how we think of
strings.

Useless thought: the string class would have unbound methods that are
almost the same as the functions defined in the string module, e.g.
string.split(s) and string.strip(s) could be made to call s.split()
and s.strip(), just like the module.  The class could have data
attributes for string.whitespace etc.  But string.join() would have a
different signature: the class method is join(s, list) while the
function is join(list, s).  So we can't quite make the module an alias
for the class. :-(

> I would like to boil this down to one API if possible which then
> implements unicode(obj) and unicode(obj, encoding, errors) -- if
> no encoding is given the semantics of PyObject_Str() are closely
> followed, with encoding the semantics of PyUnicode_FromEncodedObject()
> as it was are used (with the buffer interface logic removed).

I would actually recommend using two different C level APIs:
PyObject_Unicode() to implement unicode(obj), which should follow
str(obj), and PyUnicode_FromEncodedObject() to implement unicode(obj,
decoding[, error]), which should use the buffer API on obj.

> In a first step, I'd use the tp_str/__str__ for unicode(obj) as
> well. Later we can add a tp_unicode/__unicode__ lookup before
> trying tp_str/__str__ as fallback.

I would add __unicode__ support without tp_unicode right away.  I
would use tp_str without even looking at __str__.

> If this sounds reasonable, I'll give it a go...

Yes.

--Guido van Rossum (home page: http://www.python.org/~guido/)