unicode() vs. s.decode()
sjmachin at lexicon.net
Fri Aug 7 04:01:24 CEST 2009
Jason Tackaberry <tack <at> urandom.ca> writes:
> On Thu, 2009-08-06 at 01:31 +0000, John Machin wrote:
> > Suggested further avenues of investigation:
> > (1) Try the timing again with "cp1252" and "utf8" and "utf_8"
> > (2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c
> Very pedagogical of you. :) Indeed, it looks like bigger player in the
> performance difference is the fact that the code path for unicode(s,
> enc) short-circuits the codec registry for common encodings (which
> includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
> consults the codec registry.
So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:
Why does consulting the codec registry take so long,
and can this be improved?
More information about the Python-list