[Python-Dev] Re: Type-converting functions, esp. unicode() vs. unistr()
Guido van Rossum
guido@digicool.com
Fri, 19 Jan 2001 15:44:53 -0500
> If we agree to merge the semantics of the two APIs, then str()
> would have to change too: is this desirable ? (IMHO, yes)
Not clear. Which is why I'm backing off from my initial support for
merging the two.
I believe unicode() (which is really just an interface to
PyUnicode_FromEncodedObject()) currently already does too much. In
particular this whole business with calling __str__ on instances seems
to me to be unnecessary. I think it should *only* bother to look for
something that supports the buffer interface (checking for regular
strings only as a tiny optimization), or existing unicode objects.
> Here's what we could do:
>
> a) merge the semantics of unistr() into unicode()
> b) apply the same semantics in str()
> c) remove unistr() -- how's that for a short-living builtin ;)
>
> About the semantics:
>
> These should be backward compatible to str() in that everything
> that worked before should continue to work after the merge.
>
> A strawman for processing str() and unicode():
>
> 1. strings/Unicode is passed back as-is
I hope you mean str() passes 8-bit strings back as-is, unicode()
passes Unicode strings back as-is, right?
> 2. tp_str is tried
> 3. the method __str__ is tried
Shouldn't have to -- instances should define tp_str and all the magic
for calling __str__ should be there. I don't understand why it's not
done that way, probably just for historical reasons. I also don't
think __str__ should be tried for non-instance types.
But, more seriously, I believe tp_str or __str__ shouldn't be tried at
all by unicode().
> 4. the PyObject_AsCharBuffer() API is tried (bf_getcharbuffer)
> 5. for str(): Unicode return values are converted to strings using
> the default encoding
> for unicode(): Unicode return values are passed back as-is;
> string return values are decoded according to the
> encoding parameter
> 6. the return object is type-checked: str() will always return
> a string object, unicode() always a Unicode object
>
> Note that passing back Unicode is only allowed in case no encoding
> was given. Otherwise an execption is raised: you can't decode
> Unicode.
>
> As extension we could add encoding and error parameters to str()
> as well. The result would be either an encoding of Unicode objects
> passed back by tp_str or __str__ or a recoding of string objects
> returned by checks 2, 3 or 4.
Naaaah!
> If we agree to take this approach, then we should remove the
> unistr() Python API before the alpha ships.
Frankly, I believe we need more time to sort this out, and therefore I
propose to remove the unistr() built-in before the release. Marc,
would you do the honors?
--Guido van Rossum (home page: http://www.python.org/~guido/)