[Python-Dev] Re: Type-converting functions, esp. unicode() vs. unistr()

M.-A. Lemburg mal@lemburg.com
Fri, 19 Jan 2001 22:32:34 +0100

Guido van Rossum wrote:
> > If we agree to merge the semantics of the two APIs, then str()
> > would have to change too: is this desirable ? (IMHO, yes)
> Not clear.  Which is why I'm backing off from my initial support for
> merging the two.
> I believe unicode() (which is really just an interface to
> PyUnicode_FromEncodedObject()) currently already does too much.  In
> particular this whole business with calling __str__ on instances seems
> to me to be unnecessary.  I think it should *only* bother to look for
> something that supports the buffer interface (checking for regular
> strings only as a tiny optimization), or existing unicode objects.

Hmm, unicode() should (just like str()) take an object and
convert it to a Unicode string. Since many objects either don't
support the tp_str slot (instances don't for some reason -- just
like they don't tp_call), I had to add some special cases to
make Python instances compatible to Unicode in the same way
str() does.

What I think is really needed is a concept for "stringification"
in Python. We currently have these schemes:

1. tp_str
2. method __str__ (not only of Python instances, but any object)
3. character buffer interface

These three could easily be unified into the tp_str slot:
e.g. tp_str could do the necessary magic to call __str__
or the buffer interface.

Note that the same is true for e.g. tp_call -- the special
cases we have in ceval.c for the different builtin callable
objects would not be necessary if they would implement tp_call.

> > Here's what we could do:
> >
> > a) merge the semantics of unistr() into unicode()
> > b) apply the same semantics in str()
> > c) remove unistr() -- how's that for a short-living builtin ;)
> >
> > About the semantics:
> >
> > These should be backward compatible to str() in that everything
> > that worked before should continue to work after the merge.
> >
> > A strawman for processing str() and unicode():
> >
> > 1. strings/Unicode is passed back as-is
> I hope you mean str() passes 8-bit strings back as-is, unicode()
> passes Unicode strings back as-is, right?

> > 2. tp_str is tried
> > 3. the method __str__ is tried
> Shouldn't have to -- instances should define tp_str and all the magic
> for calling __str__ should be there.  I don't understand why it's not
> done that way, probably just for historical reasons.  I also don't
> think __str__ should be tried for non-instance types.

> But, more seriously, I believe tp_str or __str__ shouldn't be tried at
> all by unicode().

Hmm, but how would you implement generic conversion to Unicode 
then ? 

We'll need some way for instances (and other types) to
provide a conversion to Unicode. Some time ago we discussed this
issue and came to the conclusion that tp_str should be allowed
to return Unicode data instead of inventing a new tp_unicode
slot for this purpose.

> > 4. the PyObject_AsCharBuffer() API is tried (bf_getcharbuffer)
> > 5. for str(): Unicode return values are converted to strings using
> >               the default encoding
> >    for unicode(): Unicode return values are passed back as-is;
> >               string return values are decoded according to the
> >               encoding parameter
> > 6. the return object is type-checked: str() will always return
> >    a string object, unicode() always a Unicode object
> >
> > Note that passing back Unicode is only allowed in case no encoding
> > was given. Otherwise an execption is raised: you can't decode
> > Unicode.
> >
> > As extension we could add encoding and error parameters to str()
> > as well. The result would be either an encoding of Unicode objects
> > passed back by tp_str or __str__ or a recoding of string objects
> > returned by checks 2, 3 or 4.
> Naaaah!

Would be nice for symmetry and useful in the light of making
Unicode the only string type in Py4k ;-)
> > If we agree to take this approach, then we should remove the
> > unistr() Python API before the alpha ships.
> Frankly, I believe we need more time to sort this out, and therefore I
> propose to remove the unistr() built-in before the release.  Marc,
> would you do the honors?


I'll remove the builtin and the docs, but will leave the
PyObject_Unicode() API enabled.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/