[Python-Dev] Type-converting functions, esp. unicode() vs. unistr()

Ka-Ping Yee ping@lfw.org
Thu, 18 Jan 2001 02:14:19 -0800 (PST)


I hope you don't mind that i'm taking this over to python-dev,
because it led me to discover a more general issue (see below).

For the others on python-dev, here's the background: MAL was
about to check in the unistr() function, described as follows:

> This patch adds a utility function unistr() which works just like
> the standard builtin str()  -- only that the return value will
> always be a Unicode object.
> 
> The patch also adds a new object level C API PyObject_Unicode()
> which complements PyObject_Str().

I responded:
> Why are unistr() and unicode() two separate functions?
> 
> str() performs one task: convert to string.  It can convert anything,
> including strings or Unicode strings, numbers, instances, etc.
> 
> The other type-named functions e.g. int(), long(), float(), list(),
> tuple() are similar in intent.
> 
> Why have unicode() just for converting strings to Unicode strings,
> and unistr() for converting everything else to a Unicode string?
> What does unistr(x) do differently from unicode(x) if x is a string?

MAL responded:
> unistr() is meant to complement str() very closely. unicode()
> works as constructor for Unicode objects which can also take
> care of decoding encoded data. str() and unistr() don't provide
> this capability but instead always assume the default encoding.
> 
> There's also a subtle difference in that str() and unistr() 
> try the tp_str slot which unicode() doesn't. unicode()
> supports any character buffer which str() and unistr() don't.

Okay, given this explanation, i still feel fairly confident
that unicode() should subsume unistr().  Many of the other
type-named functions try various slots:

    int() looks for __int__
    float() looks for __float__
    long() looks for __long__
    str() looks for __str__

In testing this i also discovered the following:

    >>> class Foo:
    ...     def __int__(self):
    ...         return 3
    ... 
    >>> f = Foo()
    >>> int(f)
    3
    >>> long(f) 
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    AttributeError: Foo instance has no attribute '__long__'
    >>> float(f)
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    AttributeError: Foo instance has no attribute '__float__'

This is kind of surprising.  How about:

    int() looks for __int__
    float() looks for __float__, then tries __int__
    long() looks for __long__, then tries __int__
    str() looks for __str__
    unicode() looks for __unicode__, then tries __str__

The extra parameter to unicode() is very similar to the extra
parameter to int(), so i think there is a natural parallel here.

Hmm... what about the other types?

Wow!!  __complex__ can produce a segfault!

    >>> complex
    <built-in function complex>
    >>> class Foo:
    ...   def __complex__(self): return 3
    ... 
    >>> Foo()
    <__main__.Foo instance at 0x81e8684>
    >>> f = _
    >>> complex(f)
    Segmentation fault (core dumped)

This happens because builtin_complex first retrieves and saves
the PyNumberMethods of the argument (in this case, from the
instance), then tries to call __complex__ (in this case, returning 3),
and THEN coerces the result using nbr->nb_float if the result is
not complex!  (This calls the instance's nb_float method on the
integer object 3!!)

I think __complex__ should probably look for __complex__, then
__float__, then __int__.

One could argue for __list__, __tuple__, or __dict__, but that
seems much weaker; the Pythonic way has always been to implement
__getitem__ instead.  There is no built-in dict(); if it existed
i suppose it would do the opposite of x.items(); again a weak
argument, though i might have found such a function useful once
or twice.

And that about covers the built-in types for data.


-- ?!ng