[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

Thu Feb 24 17:35:39 CET 2011

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky at users.sourceforge.net> added the comment:
> 
> On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
> <report at bugs.python.org> wrote:
> ..
>> On this ticker, we're discussing just one application area: that
>> of the builtin short cuts.
>>
> Fair enough.  I was hoping to close this ticket by simply committing
> the posted patch, but it looks like people want to do more.  I don't
> think we'll get measurable performance gains but may improve code
> understandability.
> 
>> To have more encoding name variants benefit from the optimization,
>> we might want to enhance that particular normalization function
>> to avoid having to compare against "utf8" and "utf-8" in the
>> encode/decode functions.
> 
> Which function are you talking about?
> 
> 1. normalize_encoding() in unicodeobject.c
> 2. normalizestring() in codecs.c

The first one, since that's being used by the shortcuts.

> The first is s.lower().replace('-', '_') and the second is

It does this: s.lower().replace('_', '-')

> s.lower().replace(' ', '_'). (Note space vs. dash difference.)
> 
> Why do we need both?  And why should they be different?

Because the first is specifically used for the shortcuts
(which can do more without breaking anything, since it's
only used internally) and the second prepares the encoding
names for lookup in the codec registry (which has a PEP100
defined behavior we cannot easily change).

----------
title: b'x'.decode('latin1') is much slower	than	b'x'.decode('latin-1') -> b'x'.decode('latin1') is much slower than	b'x'.decode('latin-1')

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue11303>
_______________________________________