
On Fri, Jun 25, 2010 at 4:02 PM, Guido van Rossum <guido@python.org> wrote:
I'd like a version of 'decode' which would give me a type that was, in every respect, unicode, and responded to all protocols exactly as other unicode objects (or "str objects", if you prefer py3 nomenclature ;-)) do, but wouldn't actually copy any of that memory unless it really needed to (for example, to pass to a C API that expected native wide characters), and that would hold on to the original bytes so that it could produce them on demand if encoded to the same encoding again. So, as others in this
On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz thread
have mentioned, the 'ABC' really implies some stuff about C APIs as well. I'm not sure about the exact performance impact of such a class, which is why I'd like the ability to implement it *outside* of the stdlib and see how it works on a project, and return with a proposal along with some data. There are also different ways to implement this, and other optimizations (like ropes) which might be better. You can almost do this today, but the lack of things like the hypothetical "__rcontains__" does make it impossible to be totally transparent about it.
But you'd still have to validate it, right? You wouldn't want to go on using what you thought was wrapped UTF-8 if it wasn't actually valid UTF-8 (or you'd be worse off than in Python 2). So you're really just worried about space consumption. I'd like to see a lot of hard memory profiling data before I got overly worried about that.
It wasn't my profiling, but I seem to recall that Fredrik Lundh specifically benchmarked ElementTree with all-unicode and sometimes-ascii-bytes, and found that using Python 2 strs in some cases provided notable advantages. I know Stefan copied ElementTree in this regard in lxml, maybe he also did a benchmark or knows of one? -- Ian Bicking | http://blog.ianbicking.org