unicode() vs. s.decode()
John Machin
sjmachin at lexicon.net
Wed Aug 5 21:31:56 EDT 2009
Jason Tackaberry <tack <at> urandom.ca> writes:
> On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> > These both expressions are equivalent but which is faster or should be used
> > for any reason?
> > u = unicode(s,'utf-8')
> > u = s.decode('utf-8') # looks nicer
>
> It is sometimes non-obvious which constructs are faster than others in
> Python. I also regularly have these questions, but it's pretty easy to
> run quick (albeit naive) benchmarks to see.
>
> The first thing to try is to have a look at the bytecode for each:
[snip]
> The presence of LOAD_ATTR in the first form hints that this is probably
> going to be slower. Next, actually try it:
>
> >>> import timeit
> >>> timeit.timeit('"foobarbaz".decode("utf-8")')
> 1.698289155960083
> >>> timeit.timeit('unicode("foobarbaz", "utf-8")')
> 0.53305888175964355
>
> So indeed, uncode(s, 'utf-8') is faster by a fair margin.
Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.
Suggested further avenues of investigation:
(1) Try the timing again with "cp1252" and "utf8" and "utf_8"
(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c
HTH,
John
More information about the Python-list
mailing list