[Python-Dev] Type-converting functions, esp. unicode() vs. unistr()

Guido van Rossum guido@digicool.com
Thu, 18 Jan 2001 21:17:36 -0500


> I hope you don't mind that i'm taking this over to python-dev,
> because it led me to discover a more general issue (see below).

No -- in fact I wanted to see this here!  (My mail backlog seems to be
clearing -- or maybe it was only a temporary unclogging... :-)

> For the others on python-dev, here's the background: MAL was
> about to check in the unistr() function, described as follows:
> 
> > This patch adds a utility function unistr() which works just like
> > the standard builtin str()  -- only that the return value will
> > always be a Unicode object.
> > 
> > The patch also adds a new object level C API PyObject_Unicode()
> > which complements PyObject_Str().
> 
> I responded:
> > Why are unistr() and unicode() two separate functions?
> > 
> > str() performs one task: convert to string.  It can convert anything,
> > including strings or Unicode strings, numbers, instances, etc.
> > 
> > The other type-named functions e.g. int(), long(), float(), list(),
> > tuple() are similar in intent.
> > 
> > Why have unicode() just for converting strings to Unicode strings,
> > and unistr() for converting everything else to a Unicode string?
> > What does unistr(x) do differently from unicode(x) if x is a string?
> 
> MAL responded:
> > unistr() is meant to complement str() very closely. unicode()
> > works as constructor for Unicode objects which can also take
> > care of decoding encoded data. str() and unistr() don't provide
> > this capability but instead always assume the default encoding.
> > 
> > There's also a subtle difference in that str() and unistr() 
> > try the tp_str slot which unicode() doesn't. unicode()
> > supports any character buffer which str() and unistr() don't.
> 
> Okay, given this explanation, i still feel fairly confident
> that unicode() should subsume unistr().  Many of the other
> type-named functions try various slots:
> 
>     int() looks for __int__
>     float() looks for __float__
>     long() looks for __long__
>     str() looks for __str__
> 
> In testing this i also discovered the following:
> 
>     >>> class Foo:
>     ...     def __int__(self):
>     ...         return 3
>     ... 
>     >>> f = Foo()
>     >>> int(f)
>     3
>     >>> long(f) 
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     AttributeError: Foo instance has no attribute '__long__'
>     >>> float(f)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     AttributeError: Foo instance has no attribute '__float__'
> 
> This is kind of surprising.  How about:
> 
>     int() looks for __int__
>     float() looks for __float__, then tries __int__
>     long() looks for __long__, then tries __int__
>     str() looks for __str__
>     unicode() looks for __unicode__, then tries __str__

For the numeric types this could perhaps be done by calling
PyNumber_Long() from PyNumber_Float(), calling PyNumber_Int() from
PyNumber_Long().  Complex is a bit of an exception -- there's no
PyNumber_Complex(), just because I felt that nobody would need it. :-)

> The extra parameter to unicode() is very similar to the extra
> parameter to int(), so i think there is a natural parallel here.

Makes sense.

> Hmm... what about the other types?
> 
> Wow!!  __complex__ can produce a segfault!
> 
>     >>> complex
>     <built-in function complex>
>     >>> class Foo:
>     ...   def __complex__(self): return 3
>     ... 
>     >>> Foo()
>     <__main__.Foo instance at 0x81e8684>
>     >>> f = _
>     >>> complex(f)
>     Segmentation fault (core dumped)
> 
> This happens because builtin_complex first retrieves and saves
> the PyNumberMethods of the argument (in this case, from the
> instance), then tries to call __complex__ (in this case, returning 3),
> and THEN coerces the result using nbr->nb_float if the result is
> not complex!  (This calls the instance's nb_float method on the
> integer object 3!!)

Thanks!  Fixed now in CVS.

> I think __complex__ should probably look for __complex__, then
> __float__, then __int__.

I make it call PyNumber_Float(), which could be made smarter as
explained above.

> One could argue for __list__, __tuple__, or __dict__, but that
> seems much weaker; the Pythonic way has always been to implement
> __getitem__ instead.

Yes -- since __list__ etc. aren't used, let's not add them.

> There is no built-in dict(); if it existed
> i suppose it would do the opposite of x.items(); again a weak
> argument, though i might have found such a function useful once
> or twice.

Yeah, it's not very common.  Dict comprehensions anyone?

    d = {k:v for k,v in zip(range(10), range(10))}    # :-)

> And that about covers the built-in types for data.

Thanks!

--Guido van Rossum (home page: http://www.python.org/~guido/)