[Python-Dev] Maintenance burden of str.swapcase

Stephen J. Turnbull stephen at xemacs.org
Thu Sep 8 04:46:42 CEST 2011


Glyph Lefkowitz writes:
 > On Sep 7, 2011, at 10:26 AM, Stephen J. Turnbull wrote:
 > 
 > > How about "title"?
 > 
 > >>> 'content-length'.title()
 > 'Content-Length'
 > 
 > You might say that the protocol "has" to be case-insensitive so
 > this is a silly frill:

Not me, sir.  My whole point about the "bytes should be more like str"
controversy is the dual of that: you don't know what will be coming at
you, so the regularities and (normally allowable) fuzziness of text
processing are inadmissible.

 > there are definitely enough case-sensitive crappy bits of network
 > middleware out there that this function is critically important for
 > an HTTP server.

"Critically important" is surely an overstatement.  You could always
title-case the literal strings containing field names in the source.

The problem with having lots of str-like features on bytes is that you
lose TOOWDTI, or worse, to many performance-happy coders, use of bytes
becomes TOOWDTI "because none of the characters[sic] I'm planning to
process myself are non-ASCII".  This is the road to Babel; it's
workable for one-off scripts but it's asking for long-term trouble in
multi-module applications.  The choice of decoding to str and
processing in that form should be made as attractive as possible.

On the other hand, it is undeniably useful for protocol tokens to have
mnemonic representations even in binary protocols.  Textual
manipulations on those tokens should be convenient.

It seems to me that what might be an improvement over the current
situation (maybe for Py4k only, though) is for bytes and
(PEP-393-style) str to share representation, and have a "cast" method
which would convert from one to the other, validating that the range
contraints on the representation are satisfied.  The problem I see is
that this either sanctions the practice of using latin-1 as "ASCII
plus anything", which is an unpleasant hack, or you'd need to check in
text methods that nothing is done with non-ASCII values other than
checks for set membership (including equality comparison, of course).

OTOH, AFAICS, Antoine's claim that inserting a non-latin-1 character
in a str that happens to contain only ASCII values would convert the
representation to multioctets (true), and therefore this doesn't give
the desired efficiency properties, is beside the point.  Just don't do
that!  You *can't* do that in a bytes object, anyway; use of str in
this way is a "consenting adults" issue.  You trade off the
convenience of the full suite of text tools vs. the possibility that
somebody might insert such a character -- but for the algorithms
they're going to be using, they shouldn't be doing that anyway.



More information about the Python-Dev mailing list