[Python-Dev] Maintenance burden of str.swapcase
Stephen J. Turnbull
stephen at xemacs.org
Wed Sep 7 19:26:00 CEST 2011
Antoine Pitrou writes:
> You could also point out UTF-16 or EBCDIC, but I fail to see how that's
> relevant. Do you have problems with ISO 2022 when parsing, say, e-mail
> headers?
Yes, of course! Especially when it's say, packed EUC not encapsulated
in MIME words. I think Mailman now handles that without crashing, but
it took 10 years. Most Emacs MUAs still blow chunks on that. My
procmail recipes and my employer's virus checker both occasionally punt.
The point about ISO 2022 is that it allows arbitrary binary crap in
the stream, delimited by appropriate well-defined constructs. Just
like the ASCII-like tokens in the protocols you talk about. But
parsing full-bore ISO 2022 is non-trivial, especially if you're going
to try to provide error-handling that's useful to the user. Nobody
ever really took it seriously as a solution to the problem of
internationalization in the 15 years or so when it was the only
solution, and even less so once it became clear that UCSes were going
to get traction.
> > > not arbitrary "arrays of bytes". And making indexing of bytes
> > > objects return ints was IMHO a mistake.
> >
> > Bytes objects are not ASCII strings, even though they can be used to
> > represent them.
>
> I'm talking about practice,
So am I, and so is Nick.
> not some idealistic view of the world.
> In many use cases (XML, HTML, e-mail headers, many other test-based
> protocols), you can get a mixture of ASCII "commands", and opaque
> binary stuff (which will or will not, depending on these "commands",
> have a meaningful unicode decoding).
Yeah, so what? Those protocol tokens are deliberately chosen to
resemble ASCII text, but you need to parse them out of the binary
sludge somehow, and the surrounding content remains binary sludge
until deserialized or (for text) decoded. How is having b[0] return a
bytes object, rather than an integer, going to help in that?
Especially if the value is not in the ASCII range?
> > AFAICS, anything that should be done with ASCII-punned magic numbers
> > ("protocol tokens", if you prefer) can be done with slices and (ta-da!)
> > case conversion.
>
> So, basically, you're saying that we should remove useful functionality
No, that *was* Nick's position; I specifically opposed the suggestion
that "lower" and "upper" be removed, and he concurred after a bit of
thought. And remember, he's talking about removing "swapcase". Which
RFC defines a protocol where that would be useful? How about "title"?
> and tell people to reimplement an adhoc version of it when they
> need it.
Of course not; I'm with Michael Foord on that: nobody should ever be
asked to reimplement swapcase! My position is simply that bytes are
not text, and the occasional reminder (such as b[0] returning an
integer, not a bytes object) is good. My experience has been that it
makes a lot of sense to layer these things, for example transforming a
protocol stream serialized as octets into a more structured object
composed of protocol tokens and payloads. It's *not* text, and the
relevant techniques are different.
It's like the old saw about "aha, I'll use regexps to solve this
problem!" and now you have *two* problems.
I don't advocate getting rid of regexps, and I don't advocate removing
methods from bytes (although I do dream about it occasionally). I do
advocate that people think twice before implementing complex text-like
algorithms on binary protocol streams. If the stream really is
text-like, then transform it into text of a known, well-behaved
encoding, and then apply the powerful text-processing facilities
provided for str. If it's not, then transform to a token stream or
whatever makes sense. In both cases, do as little "text processing"
on bytes objects as possible, and put more structure on the content as
soon as possible.
If you really need the efficiency, then do what you need to do. As I
say, I don't have any practical objection to keeping your tools for
that case. But such applications, although important (I guess), are a
minority.
> That sounds obnoxious.
Good advice almost always sounds obnoxious to the recipient.
More information about the Python-Dev
mailing list