[Python-Dev] Re: Type-converting functions, esp. unicode() vs.
unistr()
M.-A. Lemburg
mal@lemburg.com
Fri, 19 Jan 2001 22:32:34 +0100
Guido van Rossum wrote:
>
> > If we agree to merge the semantics of the two APIs, then str()
> > would have to change too: is this desirable ? (IMHO, yes)
>
> Not clear. Which is why I'm backing off from my initial support for
> merging the two.
>
> I believe unicode() (which is really just an interface to
> PyUnicode_FromEncodedObject()) currently already does too much. In
> particular this whole business with calling __str__ on instances seems
> to me to be unnecessary. I think it should *only* bother to look for
> something that supports the buffer interface (checking for regular
> strings only as a tiny optimization), or existing unicode objects.
Hmm, unicode() should (just like str()) take an object and
convert it to a Unicode string. Since many objects either don't
support the tp_str slot (instances don't for some reason -- just
like they don't tp_call), I had to add some special cases to
make Python instances compatible to Unicode in the same way
str() does.
What I think is really needed is a concept for "stringification"
in Python. We currently have these schemes:
1. tp_str
2. method __str__ (not only of Python instances, but any object)
3. character buffer interface
These three could easily be unified into the tp_str slot:
e.g. tp_str could do the necessary magic to call __str__
or the buffer interface.
Note that the same is true for e.g. tp_call -- the special
cases we have in ceval.c for the different builtin callable
objects would not be necessary if they would implement tp_call.
> > Here's what we could do:
> >
> > a) merge the semantics of unistr() into unicode()
> > b) apply the same semantics in str()
> > c) remove unistr() -- how's that for a short-living builtin ;)
> >
> > About the semantics:
> >
> > These should be backward compatible to str() in that everything
> > that worked before should continue to work after the merge.
> >
> > A strawman for processing str() and unicode():
> >
> > 1. strings/Unicode is passed back as-is
>
> I hope you mean str() passes 8-bit strings back as-is, unicode()
> passes Unicode strings back as-is, right?
Right.
> > 2. tp_str is tried
> > 3. the method __str__ is tried
>
> Shouldn't have to -- instances should define tp_str and all the magic
> for calling __str__ should be there. I don't understand why it's not
> done that way, probably just for historical reasons. I also don't
> think __str__ should be tried for non-instance types.
Ok.
> But, more seriously, I believe tp_str or __str__ shouldn't be tried at
> all by unicode().
Hmm, but how would you implement generic conversion to Unicode
then ?
We'll need some way for instances (and other types) to
provide a conversion to Unicode. Some time ago we discussed this
issue and came to the conclusion that tp_str should be allowed
to return Unicode data instead of inventing a new tp_unicode
slot for this purpose.
> > 4. the PyObject_AsCharBuffer() API is tried (bf_getcharbuffer)
> > 5. for str(): Unicode return values are converted to strings using
> > the default encoding
> > for unicode(): Unicode return values are passed back as-is;
> > string return values are decoded according to the
> > encoding parameter
> > 6. the return object is type-checked: str() will always return
> > a string object, unicode() always a Unicode object
> >
> > Note that passing back Unicode is only allowed in case no encoding
> > was given. Otherwise an execption is raised: you can't decode
> > Unicode.
> >
> > As extension we could add encoding and error parameters to str()
> > as well. The result would be either an encoding of Unicode objects
> > passed back by tp_str or __str__ or a recoding of string objects
> > returned by checks 2, 3 or 4.
>
> Naaaah!
Would be nice for symmetry and useful in the light of making
Unicode the only string type in Py4k ;-)
> > If we agree to take this approach, then we should remove the
> > unistr() Python API before the alpha ships.
>
> Frankly, I believe we need more time to sort this out, and therefore I
> propose to remove the unistr() built-in before the release. Marc,
> would you do the honors?
Ok.
I'll remove the builtin and the docs, but will leave the
PyObject_Unicode() API enabled.
--
Marc-Andre Lemburg
______________________________________________________________________
Company: http://www.egenix.com/
Consulting: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/