Re: [Python-Dev] thoughts on the bytes/string discussion

25 Jun 2010


      Glyph Lefkowitz wrote:
...
On Jun 25, 2010, at 5:02 PM, Guido van Rossum wrote:
...
But you'd still have to validate it, right? You wouldn't want to go on
using what you thought was wrapped UTF-8 if it wasn't actually valid
UTF-8 (or you'd be worse off than in Python 2). So you're really just
worried about space consumption.
So, yes, I am mainly worried about memory consumption, but don't
underestimate the pure CPU cost of doing all the copying.  It's quite a
bit faster to simply scan through a string than to scan and while you're
scanning, keep faulting out the L2 cache while you're accessing some
other area of memory to store the copy.
Yes, but you are already talking about optimizations that might be
significant for large-ish strings (where large-ish depends on exactly
where Moore's Law is currently delivering computational performance) -
the amount of cache consumed by a ten-byte string will slip by
unnoticed, but at L2 levels megabytes would effectively flush the cache.
...
Plus, If I am decoding with the surrogateescape error handler (or its
effective equivalent), then no, I don't need to validate it in advance;
interpretation can be done lazily as necessary.  I realize that this is
just GIGO, but I wouldn't be doing this on data that didn't have an
explicitly declared or required encoding in the first place.
...
I'd like to see a lot of hard memory profiling data before I got
overly worried about that.
I know of several Python applications that are already constrained by
memory.  I don't have a lot of hard memory profiling data, but in an
environment where you're spawning as many processes as you can in order
to consume _all_ the physically available RAM for string processing, it
stands to reason that properly decoding everything and thereby exploding
everything out into 4x as much data (or 2x, if you're lucky) would
result in a commensurate decrease in throughput.
Yes, UCS-4's impact does seem like to could be horrible for these use
cases. But "knowing of several Python applications that are already
constrained by memory" doesn't mean that it's a bad general decision.
Most users will never notice the difference, so we should try to
accommodate those who do notice a difference without inconveniencing the
rest too much.
...
I don't think I could even reasonably _propose_ that such a project stop
treating textual data as bytes, because there's no optimization strategy
once that sort of architecture has been put into place. If your function
says "this takes unicode", then you just have to bite the bullet and
decode it, or rewrite it again to have a different requirement.
That has always been my understanding. I regard it as a sort of
intellectual tax on the United States (and its Western collaborators)
for being too dim to realise that eventually they would end up selling
computers to people with more than 256 characters in their alphabet).
Sorry guys, but your computers are only as fast as you think they are
when you only talk to each other.
...
So, right now, I don't know where I'd get the data with to make the
argument in the first place :).  If there were some abstraction in the
core's treatment of strings, though, and I could decode things and note
their encoding without immediately paying this cost (or alternately,
paying the cost to see if it's so bad, but with the option of managing
it or optimizing it separately).  This is why I'm asking for a way for
me to implement my own string type, and not for a change of behavior or
an optimization in the stdlib itself: I could be wrong, I don't have a
particularly high level of certainty in my performance estimates, but I
think that my concerns are realistic enough that I don't want to embark
on a big re-architecture of text-handling only to have it become a
performance nightmare that needs to be reverted.
Recent experience with the thoroughness of the Python 3 release
preparations leads me to believe that *anything* new needs to prove its
worth outside the stdlib for a while.
...
As Robert Collins pointed out, they already have performance issues
related to encoding in Bazaar.  I know they've done a lot of profiling
in that area, so I hope eventually someone from that project will show
up with some data to demonstrate it :).  And I've definitely heard many,
many anecdotes (some of them in this thread) about people distorting
their data structures in various ways to avoid paying decoding cost in
the ASCII/latin1 case, whether it's *actually* a significant performance
issue or not.  I would very much like to tell those people "Just call
.decode(), and if it turns out to actually be a performance issue, you
can always deal with it later, with a custom string type."  I'm
confident that in *most* cases, it would not be.
Well that would be a nice win.
...
Anyway, this may be a serious issue, but I increasingly feel like I'm
veering into python-ideas territory, so perhaps I'll just have to burn
this bridge when I come to it.  Hopefully after the moratorium.
Sounds like it's worth pursuing, though. I mean after all, we don't want
to leave *all* the bit-twiddling to the low-level language users ;-).

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
See Python Video!       http://python.mirocommunity.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
                                     Ian Dury, 1942-2000