unicode() vs. s.decode()

Sat Aug 8 12:02:48 EDT 2009

Michael Ströder wrote:
 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
 > 17.23644495010376
 > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
 > 72.087096929550171
 >
 > That is significant! So the winner is:
 >
 > unicode('äöüÄÖÜß','utf-8')

Which proves that benchmark results can be misleading sometimes. :-)

unicode() becomes *slower* when you try "UTF-8" in uppercase, or an 
entirely different codec, say "cp1252":

   >>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(1000000)
   2.5777881145477295
   >>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(1000000)
   1.8430399894714355
   >>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(1000000)
   2.3622498512268066
   >>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(1000000)
   1.7812771797180176

The reason seems to be that unicode() bypasses codecs.lookup() if the 
encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH, 
str.decode() always calls codecs.lookup().

If speed is your primary concern, this will give you even better 
performance than unicode():

   decoder = codecs.lookup("utf-8").decode
   for i in xrange(1000000):
       decoder("äöüÄÖÜß")[0]

However, there's also a functional difference between unicode() and 
str.decode():

unicode() always raises an exception when you try to decode a unicode 
object. str.decode() will first try to encode a unicode object using the 
default encoding (usually "ascii"), which might or might not work.

Kind Regards,
M.F.