unicode() vs. s.decode()

Wed Aug 5 11:53:56 EDT 2009

On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> These both expressions are equivalent but which is faster or should be used
> for any reason?
>
> u = unicode(s,'utf-8')
> 
> u = s.decode('utf-8') # looks nicer

It is sometimes non-obvious which constructs are faster than others in
Python.  I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each:

        >>> import dis
        >>> dis.dis(lambda s: s.decode('utf-8'))
          1           0 LOAD_FAST                0 (s)
                      3 LOAD_ATTR                0 (decode)
                      6 LOAD_CONST               0 ('utf-8')
                      9 CALL_FUNCTION            1
                     12 RETURN_VALUE        
        >>> dis.dis(lambda s: unicode(s, 'utf-8'))
          1           0 LOAD_GLOBAL              0 (unicode)
                      3 LOAD_FAST                0 (s)
                      6 LOAD_CONST               0 ('utf-8')
                      9 CALL_FUNCTION            2
                     12 RETURN_VALUE      

The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower.   Next, actually try it:

        >>> import timeit
        >>> timeit.timeit('"foobarbaz".decode("utf-8")')
        1.698289155960083
        >>> timeit.timeit('unicode("foobarbaz", "utf-8")')
        0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.

Cheers,
Jason.