unicode() vs. s.decode()

Wed Aug 5 21:31:56 EDT 2009

Jason Tackaberry <tack <at> urandom.ca> writes:
> On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> > These both expressions are equivalent but which is faster or should be used
> > for any reason?
> > u = unicode(s,'utf-8')
> > u = s.decode('utf-8') # looks nicer
> 
> It is sometimes non-obvious which constructs are faster than others in
> Python.  I also regularly have these questions, but it's pretty easy to
> run quick (albeit naive) benchmarks to see.
> 
> The first thing to try is to have a look at the bytecode for each:
[snip] 
> The presence of LOAD_ATTR in the first form hints that this is probably
> going to be slower.   Next, actually try it:
> 
>         >>> import timeit
>         >>> timeit.timeit('"foobarbaz".decode("utf-8")')
>         1.698289155960083
>         >>> timeit.timeit('unicode("foobarbaz", "utf-8")')
>         0.53305888175964355
> 
> So indeed, uncode(s, 'utf-8') is faster by a fair margin.

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

HTH,
John