[Python-Dev] Type-converting functions, esp. unicode() vs. unistr()
Ka-Ping Yee
ping@lfw.org
Thu, 18 Jan 2001 02:14:19 -0800 (PST)
I hope you don't mind that i'm taking this over to python-dev,
because it led me to discover a more general issue (see below).
For the others on python-dev, here's the background: MAL was
about to check in the unistr() function, described as follows:
> This patch adds a utility function unistr() which works just like
> the standard builtin str() -- only that the return value will
> always be a Unicode object.
>
> The patch also adds a new object level C API PyObject_Unicode()
> which complements PyObject_Str().
I responded:
> Why are unistr() and unicode() two separate functions?
>
> str() performs one task: convert to string. It can convert anything,
> including strings or Unicode strings, numbers, instances, etc.
>
> The other type-named functions e.g. int(), long(), float(), list(),
> tuple() are similar in intent.
>
> Why have unicode() just for converting strings to Unicode strings,
> and unistr() for converting everything else to a Unicode string?
> What does unistr(x) do differently from unicode(x) if x is a string?
MAL responded:
> unistr() is meant to complement str() very closely. unicode()
> works as constructor for Unicode objects which can also take
> care of decoding encoded data. str() and unistr() don't provide
> this capability but instead always assume the default encoding.
>
> There's also a subtle difference in that str() and unistr()
> try the tp_str slot which unicode() doesn't. unicode()
> supports any character buffer which str() and unistr() don't.
Okay, given this explanation, i still feel fairly confident
that unicode() should subsume unistr(). Many of the other
type-named functions try various slots:
int() looks for __int__
float() looks for __float__
long() looks for __long__
str() looks for __str__
In testing this i also discovered the following:
>>> class Foo:
... def __int__(self):
... return 3
...
>>> f = Foo()
>>> int(f)
3
>>> long(f)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: Foo instance has no attribute '__long__'
>>> float(f)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: Foo instance has no attribute '__float__'
This is kind of surprising. How about:
int() looks for __int__
float() looks for __float__, then tries __int__
long() looks for __long__, then tries __int__
str() looks for __str__
unicode() looks for __unicode__, then tries __str__
The extra parameter to unicode() is very similar to the extra
parameter to int(), so i think there is a natural parallel here.
Hmm... what about the other types?
Wow!! __complex__ can produce a segfault!
>>> complex
<built-in function complex>
>>> class Foo:
... def __complex__(self): return 3
...
>>> Foo()
<__main__.Foo instance at 0x81e8684>
>>> f = _
>>> complex(f)
Segmentation fault (core dumped)
This happens because builtin_complex first retrieves and saves
the PyNumberMethods of the argument (in this case, from the
instance), then tries to call __complex__ (in this case, returning 3),
and THEN coerces the result using nbr->nb_float if the result is
not complex! (This calls the instance's nb_float method on the
integer object 3!!)
I think __complex__ should probably look for __complex__, then
__float__, then __int__.
One could argue for __list__, __tuple__, or __dict__, but that
seems much weaker; the Pythonic way has always been to implement
__getitem__ instead. There is no built-in dict(); if it existed
i suppose it would do the opposite of x.items(); again a weak
argument, though i might have found such a function useful once
or twice.
And that about covers the built-in types for data.
-- ?!ng