[Python-Dev] Re: Type-converting functions, esp. unicode() vs. unistr()

Fri, 19 Jan 2001 10:58:08 +0100

Guido van Rossum wrote:
> 
> > Ka-Ping Yee wrote:
> > >
> > > On Thu, 18 Jan 2001, Ka-Ping Yee wrote:
> > > >     str() looks for __str__
> > >
> > > Oops.  I forgot that
> > >
> > >       str() looks for __str__, then tries __repr__
> > >
> > > So, presumably,
> > >
> > >       unicode() should look for __unicode__, then __str__, then __repr__
> >
> > Not quite... str() does this:
> >
> > 1. strings are passed back as-is
> > 2. the type slot tp_str is tried
> > 3. the method __str__ is tried
> > 4. Unicode returns are converted to strings
> > 5. anything other than a string return value is rejected
> >
> > unistr() does the same, but makes sure that the return
> > value is an Unicode object.
> >
> > unicode() does the following:
> >
> > 1. for instances, __str__ is called
> > 2. Unicode objects are returned as-is
> > 3. string objects or character buffers are used as basis for decoding
> > 4. decoding is applied to the character buffer and the results
> >    are returned
> >
> > I think we should perhaps merge the two approaches into one
> > which then applies all of the above in unicode() (and then
> > forget about unistr()). This might lose hide some type errors,
> > but since all other generic constructors behave more or less
> > in the same way, I think unicode() should too.
> 
> Yes, I would like to see these merged.  I noticed that e.g. there is
> special code to compare Unicode strings in the comparison code (I
> think I *could* get rid of this now we have rich comparisons, but I
> decided to put that off), and when I looked at it it uses the same set
> of conversions as unicode().  Some of these seem questionable to me --
> why do you try so many ways to get a string out of an object?  (On the
> other hand the merge of unicode() and unistr() might have this effect
> anyway...)

... because there are so many ways to get at string
representations of objects in Python at C level.

If we agree to merge the semantics of the two APIs, then str()
would have to change too: is this desirable ? (IMHO, yes)

Here's what we could do:

a) merge the semantics of unistr() into unicode()
b) apply the same semantics in str()
c) remove unistr() -- how's that for a short-living builtin ;)

About the semantics:

These should be backward compatible to str() in that everything
that worked before should continue to work after the merge.

A strawman for processing str() and unicode():

1. strings/Unicode is passed back as-is
2. tp_str is tried
3. the method __str__ is tried
4. the PyObject_AsCharBuffer() API is tried (bf_getcharbuffer)
5. for str(): Unicode return values are converted to strings using
              the default encoding
   for unicode(): Unicode return values are passed back as-is;
              string return values are decoded according to the
              encoding parameter
6. the return object is type-checked: str() will always return
   a string object, unicode() always a Unicode object

Note that passing back Unicode is only allowed in case no encoding
was given. Otherwise an execption is raised: you can't decode
Unicode.

As extension we could add encoding and error parameters to str()
as well. The result would be either an encoding of Unicode objects
passed back by tp_str or __str__ or a recoding of string objects
returned by checks 2, 3 or 4.

If we agree to take this approach, then we should remove the
unistr() Python API before the alpha ships.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/