[Python-Dev] Re: Type-converting functions, esp. unicode() vs. unistr()

M.-A. Lemburg mal@lemburg.com
Fri, 19 Jan 2001 10:58:08 +0100

Guido van Rossum wrote:
> > Ka-Ping Yee wrote:
> > >
> > > On Thu, 18 Jan 2001, Ka-Ping Yee wrote:
> > > >     str() looks for __str__
> > >
> > > Oops.  I forgot that
> > >
> > >       str() looks for __str__, then tries __repr__
> > >
> > > So, presumably,
> > >
> > >       unicode() should look for __unicode__, then __str__, then __repr__
> >
> > Not quite... str() does this:
> >
> > 1. strings are passed back as-is
> > 2. the type slot tp_str is tried
> > 3. the method __str__ is tried
> > 4. Unicode returns are converted to strings
> > 5. anything other than a string return value is rejected
> >
> > unistr() does the same, but makes sure that the return
> > value is an Unicode object.
> >
> > unicode() does the following:
> >
> > 1. for instances, __str__ is called
> > 2. Unicode objects are returned as-is
> > 3. string objects or character buffers are used as basis for decoding
> > 4. decoding is applied to the character buffer and the results
> >    are returned
> >
> > I think we should perhaps merge the two approaches into one
> > which then applies all of the above in unicode() (and then
> > forget about unistr()). This might lose hide some type errors,
> > but since all other generic constructors behave more or less
> > in the same way, I think unicode() should too.
> Yes, I would like to see these merged.  I noticed that e.g. there is
> special code to compare Unicode strings in the comparison code (I
> think I *could* get rid of this now we have rich comparisons, but I
> decided to put that off), and when I looked at it it uses the same set
> of conversions as unicode().  Some of these seem questionable to me --
> why do you try so many ways to get a string out of an object?  (On the
> other hand the merge of unicode() and unistr() might have this effect
> anyway...)

... because there are so many ways to get at string
representations of objects in Python at C level.

If we agree to merge the semantics of the two APIs, then str()
would have to change too: is this desirable ? (IMHO, yes)

Here's what we could do:

a) merge the semantics of unistr() into unicode()
b) apply the same semantics in str()
c) remove unistr() -- how's that for a short-living builtin ;)

About the semantics:

These should be backward compatible to str() in that everything
that worked before should continue to work after the merge.

A strawman for processing str() and unicode():

1. strings/Unicode is passed back as-is
2. tp_str is tried
3. the method __str__ is tried
4. the PyObject_AsCharBuffer() API is tried (bf_getcharbuffer)
5. for str(): Unicode return values are converted to strings using
              the default encoding
   for unicode(): Unicode return values are passed back as-is;
              string return values are decoded according to the
              encoding parameter
6. the return object is type-checked: str() will always return
   a string object, unicode() always a Unicode object

Note that passing back Unicode is only allowed in case no encoding
was given. Otherwise an execption is raised: you can't decode

As extension we could add encoding and error parameters to str()
as well. The result would be either an encoding of Unicode objects
passed back by tp_str or __str__ or a recoding of string objects
returned by checks 2, 3 or 4.

If we agree to take this approach, then we should remove the
unistr() Python API before the alpha ships.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/