
Ian Bicking, 26.06.2010 00:26:
On Fri, Jun 25, 2010 at 4:02 PM, Guido van Rossum wrote:
I'd like a version of 'decode' which would give me a type that was, in every respect, unicode, and responded to all protocols exactly as other unicode objects (or "str objects", if you prefer py3 nomenclature ;-)) do, but wouldn't actually copy any of that memory unless it really needed to (for example, to pass to a C API that expected native wide characters), and that would hold on to the original bytes so that it could produce them on demand if encoded to the same encoding again. So, as others in this
On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz thread
have mentioned, the 'ABC' really implies some stuff about C APIs as well.
Well, there's the buffer API, so you can already create something that refers to an existing C buffer. However, with respect to a string, you will have to make sure the underlying buffer doesn't get freed while the string is still in use. That will be hard and sometimes impossible to do at the C-API level, even if the string is allowed to keep a reference to something that holds the buffer. At least in lxml, such a feature would be completely worthless, as text is never held by any ref-counted Python wrapper object. It's only part of the XML tree, which is allowed to change at (more or less) any time, so the underlying char* buffer could just get freed without further notice. Adding a guard against that would likely have a larger impact on the performance than the decoding operations.
I'm not sure about the exact performance impact of such a class, which is why I'd like the ability to implement it *outside* of the stdlib and see how it works on a project, and return with a proposal along with some data. There are also different ways to implement this, and other optimizations (like ropes) which might be better. You can almost do this today, but the lack of things like the hypothetical "__rcontains__" does make it impossible to be totally transparent about it.
But you'd still have to validate it, right? You wouldn't want to go on using what you thought was wrapped UTF-8 if it wasn't actually valid UTF-8 (or you'd be worse off than in Python 2). So you're really just worried about space consumption. I'd like to see a lot of hard memory profiling data before I got overly worried about that.
It wasn't my profiling, but I seem to recall that Fredrik Lundh specifically benchmarked ElementTree with all-unicode and sometimes-ascii-bytes, and found that using Python 2 strs in some cases provided notable advantages. I know Stefan copied ElementTree in this regard in lxml, maybe he also did a benchmark or knows of one?
Actually, bytes vs. unicode doesn't make that a big difference in Py2 for lxml. ElementTree is a lot older, so I guess it made a larger difference when its code was written (and I even think I recall seeing numbers for lxml where it seemed to make a notable difference). In lxml, text content is stored in the C tree of libxml2 as UTF-8 encoded char* text. On request, lxml creates a string object from it and returns it. In Py2, it checks for plain ASCII content first and returns a byte string for that. Only non-ASCII strings are returned as decoded unicode strings. In Py3, it always returns unicode strings. When I run a little benchmark on lxml in Py2.6.5 that just reads some short text content from an Element object, I only see a tiny difference between unicode strings and byte strings. The gap obviously increases when the text gets longer, e.g. when I serialise the complete text content of an XML document to either a byte string or a unicode string. But even for documents in the megabyte range we are still talking about single milliseconds here, and the difference stays well below 10%. It's seriously hard to make that the performance bottleneck in an XML application. Also, since the string objects are only instantiated at request, memory isn't an issue either. That's different for (c)ElementTree again, where string content is stored as Python objects. Four times the size even for plain ASCII strings (e.g. numbers, IDs or even trailing whitespace!) can well become a problem there, and can easily dominate the overall size of the in-memory tree. Plain ASCII content is surprisingly common in XML documents. Stefan