[Python-Dev] bytes / unicode

Wed Jun 23 02:57:48 CEST 2010

On Wed, Jun 23, 2010 at 12:25 PM, Glyph Lefkowitz
<glyph at twistedmatrix.com> wrote:
> I can also appreciate what's been said in this thread a bunch of times: to my knowledge, nobody has actually shown a profile of an application where encoding is significant overhead.  I believe that encoding _will_ be a significant overhead for some applications (and actually I think it will be very significant for some applications that I work on), but optimizations should really be implemented once that's been demonstrated, so that there's a better understanding of what the overhead is, exactly.  Is memory a big deal?  Is CPU?  Is it both?  Do you want to tune for the tradeoff?  etc, etc.  Clever data-structures seem premature until someone has a good idea of all those things.

bzr has a cache of decoded strings in it precisely because decode is
slow. We accept slowness encoding to the users locale because thats
typically much less data to examine than we've examined while
generating the commit/diff/whatever. We also face memory pressure on a
regular basis, and that has been, at least partly, due to UCS4 - our
translation cache helps there because we have less duplicate UCS4
strings.

You're welcome to dig deeper into this, but I don't have more detail
paged into my head at the moment.

-Rob