
Guido van Rossum <guido@python.org> wrote:
On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz <glyph@twistedmatrix.com> wrote:
On Jun 24, 2010, at 4:59 PM, Guido van Rossum wrote:
Regarding the proposal of a String ABC, I hope this isn't going to become a backdoor to reintroduce the Python 2 madness of allowing equivalency between text and bytes for *some* strings of bytes and not others.
I never actually replied to this... Absolutely right, which is why you might really want another kind of string, rather than a way to treat some bytes values as strings in some places. Both Python 2 and Python 3 are missing one of the three types. Python 1 and 2 didn't have "bytes", and this caused problems because "str" was pressed into use to hold arbitrary byte sequences. (Python 2 "str" has other problems as well, like losing track of the encoding.) Python 3 doesn't have Python 2's "str" (encoded string), and bytes are being pressed into use for that. Each of these uses is an ad hoc hijack of an inappropriate type, and additional frameworks not directly supported by the Python language are being jury-rigged to try to support the uses. On the other hand, this is all in the eye of the beholder. Both byte sequences and strings are horrible formless things; they remind me of BLISS. You seldom really have a byte sequence; what you have is an XDR float or an encoded string or an IP header or an email message. Similarly for strings; they are really file names or city names or English sentences or URIs or other things with significant semantic constraints not captured by the typical type system. So, yes, there *is* an inescapable equivalency between text and bytes for *some* sequences of bytes (those that represent encoded strings) and not others (those that contain the XDR float, for instance). Creating a separate encoded string type would be one way to keep that straight.
For my part, what I want out of a string ABC is simply the ability to do application-specific optimizations. There are many applications where all input and output is text, but _must_ be UTF-8. Even GTK uses UTF-8 as its native text representation, so "output" could just be display. Right now, in Python 3, the only way to be "correct" about this is to copy every byte of input into 4 bytes of output, then copy each code point *back* into a single byte of output. If all your application does is rewrite the occasional XML attribute, for example, this cost can be significant, if not overwhelming. I'd like a version of 'decode' which would give me a type that was, in every respect, unicode, and responded to all protocols exactly as other unicode objects (or "str objects", if you prefer py3 nomenclature ;-)) do, but wouldn't actually copy any of that memory unless it really needed to (for example, to pass to a C API that expected native wide characters), and that would hold on to the original bytes so that it could produce them on demand if encoded to the same encoding again. So, as others in this thread have mentioned, the 'ABC' really implies some stuff about C APIs as well.
Seems like it.
I'm not sure about the exact performance impact of such a class, which is why I'd like the ability to implement it *outside* of the stdlib and see how it works on a project, and return with a proposal along with some data.
Yes, exactly.
There are also different ways to implement this, and other optimizations (like ropes) which might be better. You can almost do this today, but the lack of things like the hypothetical "__rcontains__" does make it impossible to be totally transparent about it.
But you'd still have to validate it, right? You wouldn't want to go on using what you thought was wrapped UTF-8 if it wasn't actually valid UTF-8 (or you'd be worse off than in Python 2).
Yes, but there are different ways to validate it that have different performance impacts. Simply trusting the source of the string, for example, would be appropriate in some cases.
So you're really just worried about space consumption. I'd like to see a lot of hard memory profiling data before I got overly worried about that.
While I've seen some big Web pages, I think the email folks, who often have to process messages with attachments measuring in the tens of megabytes, have the stronger problems here, and I think speed may be more important than memory. I've built both a Web server and an IMAP server in Python, and the IMAP server is where the issues of storage management really prevail. If you have to convert a 20 MB encoded string into a Unicode string just to look at the headers as strings, you have issues. (The Python email package doesn't do that, by the way.) Bill