unicode() vs. s.decode()
Jason Tackaberry
tack at urandom.ca
Wed Aug 5 11:53:56 EDT 2009
On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> These both expressions are equivalent but which is faster or should be used
> for any reason?
>
> u = unicode(s,'utf-8')
>
> u = s.decode('utf-8') # looks nicer
It is sometimes non-obvious which constructs are faster than others in
Python. I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.
The first thing to try is to have a look at the bytecode for each:
>>> import dis
>>> dis.dis(lambda s: s.decode('utf-8'))
1 0 LOAD_FAST 0 (s)
3 LOAD_ATTR 0 (decode)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 1
12 RETURN_VALUE
>>> dis.dis(lambda s: unicode(s, 'utf-8'))
1 0 LOAD_GLOBAL 0 (unicode)
3 LOAD_FAST 0 (s)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 2
12 RETURN_VALUE
The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower. Next, actually try it:
>>> import timeit
>>> timeit.timeit('"foobarbaz".decode("utf-8")')
1.698289155960083
>>> timeit.timeit('unicode("foobarbaz", "utf-8")')
0.53305888175964355
So indeed, uncode(s, 'utf-8') is faster by a fair margin.
On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.
Cheers,
Jason.
More information about the Python-list
mailing list