[Python-Dev] Deprecating bytes.swapcase and friends [was: Maintenance burden of str.swapcase]

Stephen J. Turnbull stephen at xemacs.org
Wed Sep 7 06:36:26 CEST 2011


This is all speculation and no hint of implementation at this point ...
redirecting this subthread to Python-Ideas.  Reply-To set accordingly.

Nick Coghlan writes:

 > Heh, I knew as soon as I sent that message that someone would be able
 > to point out a counter example. I agree that RFC 822 (and
 > case-insensitive ASCII comparison in general) is enough to save
 > lower() and upper() and co, but what about this even further reduced
 > list of text-specific methods:
 > 
 >  'capitalize'
 >  'istitle'
 >  'swapcase'
 >  'title'
 > 
 > While case-insensitive comparison makes sense for wire level data,
 > where do these methods fit in, even when embedded ASCII text fragments
 > are involved?

Well, 'capitalize' could theoretically be used to "beautify" RFC 822
field names, but realistically, to me they're a litmus test for
packages I probably don't want on my system.<0.5 wink>

I don't know if it's worth the effort to deprecate them, though.
There is a school of thought (represented on python-dev by Philip Eby
and Antoine Pitrou, among others, I would say) that says that text
with an implicit encoding is still text if you can figure out what the
encoding is, and the syntactically important tokens are invariably
ASCII, which often is enough information to do the work.  So if you
can do some operation without first converting to str, let's save the
cycles and the bytes (especially in bit-shoveling applications like
WSGI)!  I disagree, but "consenting adults" and all that.

It occurs to me that the bit-shoveling applications would generally be
sufficiently well-served with a special "codec" that just stuffs the
data pointer in a bytes object into the latin1 member of the data
pointer union in a PEP 393 Unicode object, and marks the Unicode
object as "ascii-compatible", ie, anything ASCII can be manipulated as
text, but anything non-ASCII is like a private character that Python
doesn't know anything about, and can't do anything useful with, except
delete or pass through verbatim (perhaps as a slice).

This may be nonsense; I don't know enough about Python internals to be
sure.  And it would be a change to PEP 393, since the encoding of the
8-bit representation would no longer be Unicode.  I wouldn't blame
Martin one bit if he hated the idea in principle!  On the other hand,
the "Latin-1 can be used to decode any binary content" end-around
makes that point moot IMO.  This would give a somewhat safer way of
doing that.

But if feasible and a Pythonic implementation could be devised, that
would take much of the wind out of the sails of the "implicitly it's
ASCII text" crowd.  The whole "it's inefficient in time and space to
work with 'str'" argument goes away, leaving them with "it's verbose"
as the only reason for not doing the conversion.

I don't know if there would be any use case left for bytes at that
point ... but that's clearly a py4k discussion.


More information about the Python-Dev mailing list