unicode() vs. s.decode()

Thu Aug 6 22:01:24 EDT 2009

Jason Tackaberry <tack <at> urandom.ca> writes:

> On Thu, 2009-08-06 at 01:31 +0000, John Machin wrote:

> > Suggested further avenues of investigation:
> > 
> > (1) Try the timing again with "cp1252" and "utf8" and "utf_8"
> > 
> > (2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c
> 
> Very pedagogical of you. :)  Indeed, it looks like bigger player in the
> performance difference is the fact that the code path for unicode(s,
> enc) short-circuits the codec registry for common encodings (which
> includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
> consults the codec registry.

So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:

    Why does consulting the codec registry take so long,
    and can this be improved?