[Python-Dev] Re: Type-converting functions, esp. unicode() vs. unistr()

Guido van Rossum guido@digicool.com
Fri, 19 Jan 2001 15:44:53 -0500

> If we agree to merge the semantics of the two APIs, then str()
> would have to change too: is this desirable ? (IMHO, yes)

Not clear.  Which is why I'm backing off from my initial support for
merging the two.

I believe unicode() (which is really just an interface to
PyUnicode_FromEncodedObject()) currently already does too much.  In
particular this whole business with calling __str__ on instances seems
to me to be unnecessary.  I think it should *only* bother to look for
something that supports the buffer interface (checking for regular
strings only as a tiny optimization), or existing unicode objects.

> Here's what we could do:
> a) merge the semantics of unistr() into unicode()
> b) apply the same semantics in str()
> c) remove unistr() -- how's that for a short-living builtin ;)
> About the semantics:
> These should be backward compatible to str() in that everything
> that worked before should continue to work after the merge.
> A strawman for processing str() and unicode():
> 1. strings/Unicode is passed back as-is

I hope you mean str() passes 8-bit strings back as-is, unicode()
passes Unicode strings back as-is, right?

> 2. tp_str is tried
> 3. the method __str__ is tried

Shouldn't have to -- instances should define tp_str and all the magic
for calling __str__ should be there.  I don't understand why it's not
done that way, probably just for historical reasons.  I also don't
think __str__ should be tried for non-instance types.

But, more seriously, I believe tp_str or __str__ shouldn't be tried at
all by unicode().

> 4. the PyObject_AsCharBuffer() API is tried (bf_getcharbuffer)
> 5. for str(): Unicode return values are converted to strings using
>               the default encoding
>    for unicode(): Unicode return values are passed back as-is;
>               string return values are decoded according to the
>               encoding parameter
> 6. the return object is type-checked: str() will always return
>    a string object, unicode() always a Unicode object
> Note that passing back Unicode is only allowed in case no encoding
> was given. Otherwise an execption is raised: you can't decode
> Unicode.
> As extension we could add encoding and error parameters to str()
> as well. The result would be either an encoding of Unicode objects
> passed back by tp_str or __str__ or a recoding of string objects
> returned by checks 2, 3 or 4.


> If we agree to take this approach, then we should remove the
> unistr() Python API before the alpha ships.

Frankly, I believe we need more time to sort this out, and therefore I
propose to remove the unistr() built-in before the release.  Marc,
would you do the honors?

--Guido van Rossum (home page: http://www.python.org/~guido/)