[Python-Dev] Maintenance burden of str.swapcase

Thu Sep 8 00:29:33 CEST 2011

On Thu, Sep 8, 2011 at 3:51 AM, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
> On Sep 7, 2011, at 10:26 AM, Stephen J. Turnbull wrote:
>
> How about "title"?
>
>>>> 'content-length'.title()
> 'Content-Length'
> You might say that the protocol "has" to be case-insensitive so this is a
> silly frill: there are definitely enough case-sensitive crappy bits of
> network middleware out there that this function is critically important for
> an HTTP server.

Actually, the HTTP header case occurred to me as well shortly after
sending my last message, so I think it's a legitimate reason to keep
the methods around on bytes and bytearray.

So, putting my "practicality beats purity" hat back on, I would
describe the status quo as follows:

1. Binary data is not text, so bytes and bytearray are deliberately
conceptualised as arrays of arbitrary integers in the range 0-255
rather than as arrays of 8-bit 'characters'. This distinction is one
of the core design principles separating Python 3 from Python 2.

2. However, the use of ASCII words and characters is a common feature
of many existing wire protocols, so it is useful to be able to
manipulate binary sequences that contain data in an ASCII-compatible
format without having to convert them to text first. Retaining
additional ASCII-based methods also eases the transition to Python 3
for code that manipulates binary data using the 2.x str type.

3. ASCII whitespace characters are used as delimeters in many formats.
Thus, various methods such as split(), partition(), strip() and their
variants, retain their "ASCII whitespace" default arguments and
expandtabs() is also retained.

4. Padding values out to fill fields of a certain size is needed for
some formats. Thus, center(), ljust(), rjust(), zfill() are retained
(again retaining their ASCII space default fill character in the case
of the first 3 methods)

5. Identifying ASCII alphanumeric data is important for some formats.
Thus, isalnum(), isalpha() and isdigit() are retained.

6. Case insensitive ASCII comparisons are important for some formats
(e.g. RFC 822 headers, HTTP headers). Thus, upper(), lower(),
isupper() and islower() are retained.

7. Even correct mixed case ASCII can be important for some formats
(e.g. HTTP headers). Thus, capitalize(), title() and istitle() are
retained.

8. A valid use for swapcase() on binary data has not been identified,
but once all the other ASCII based methods are being kept around for
the various reasons given above, it doesn't seem worth the effort to
get rid of this one (despite the additional implementation effort
needed for alternate implementations).

9. Algorithms that operate purely on binary data or purely on text can
just use literals of the appropriate type (if they use literals at
all). Algorithms that are designed to operate on either kind of data
may want to adopt an implicit decode/encode approach to handle binary
inputs (this allows assumptions regarding the input encoding to be
made explicit).

I'm actually fairly happy with that rationalisation for the current
Python 3 set up. I'd been thinking recently that we would have been
better off if more of the methods that rely on the data using an ASCII
compatible encoding scheme had been removed from bytes and bytearray,
but swapcase() is really the only one we can't give a decent
justification for beyond "it was there in 2.x".

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia