Deprecating bytes.swapcase and friends [was: Maintenance burden of str.swapcase]

This is all speculation and no hint of implementation at this point ... redirecting this subthread to Python-Ideas. Reply-To set accordingly. Nick Coghlan writes:
Well, 'capitalize' could theoretically be used to "beautify" RFC 822 field names, but realistically, to me they're a litmus test for packages I probably don't want on my system.<0.5 wink> I don't know if it's worth the effort to deprecate them, though. There is a school of thought (represented on python-dev by Philip Eby and Antoine Pitrou, among others, I would say) that says that text with an implicit encoding is still text if you can figure out what the encoding is, and the syntactically important tokens are invariably ASCII, which often is enough information to do the work. So if you can do some operation without first converting to str, let's save the cycles and the bytes (especially in bit-shoveling applications like WSGI)! I disagree, but "consenting adults" and all that. It occurs to me that the bit-shoveling applications would generally be sufficiently well-served with a special "codec" that just stuffs the data pointer in a bytes object into the latin1 member of the data pointer union in a PEP 393 Unicode object, and marks the Unicode object as "ascii-compatible", ie, anything ASCII can be manipulated as text, but anything non-ASCII is like a private character that Python doesn't know anything about, and can't do anything useful with, except delete or pass through verbatim (perhaps as a slice). This may be nonsense; I don't know enough about Python internals to be sure. And it would be a change to PEP 393, since the encoding of the 8-bit representation would no longer be Unicode. I wouldn't blame Martin one bit if he hated the idea in principle! On the other hand, the "Latin-1 can be used to decode any binary content" end-around makes that point moot IMO. This would give a somewhat safer way of doing that. But if feasible and a Pythonic implementation could be devised, that would take much of the wind out of the sails of the "implicitly it's ASCII text" crowd. The whole "it's inefficient in time and space to work with 'str'" argument goes away, leaving them with "it's verbose" as the only reason for not doing the conversion. I don't know if there would be any use case left for bytes at that point ... but that's clearly a py4k discussion.

On Wed, Sep 7, 2011 at 2:36 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I don't know if it's worth the effort to deprecate them, though.
I could live with a purely documentation based deprecation, although I'd prefer to *actually* deprecate at least those four methods on bytes and bytearray objects (since we switched mailing lists, reproducing the list for reference: 'capitalize', 'istitle', 'swapcase', 'title').
FWIW, I actually used to be in that school myself, *until* I took on the task of making more of the urllib.parse APIs take a polymorphic bytes-in-bytes-out, str-in-str-out approach for 3.2. The difference in complexity between the "right" way (i.e. decoding with the ascii codec, manipulating as Unicode, encoding back to bytes with the ascii codec) and a hackier approach that tried to manipulate the bytes directly was such that I didn't even end up benchmarking the two approaches to decide between them - I ended up having zero interest in attempting to maintain the latter version, so the implicit decode/encode is the version that went into the release. That experience pushed me solidly in the direction of arbitrary fast ASCII text manipulation without encoding/decoding overhead in Python 3 being a task for a third party type - neither bytes nor str fit the bill. To be really effective, such a type either needs algorithms dedicated to using it so that all the associated 'literals' are predefined as objects of the relevant type and don't need to worry about handling actual strings being passed in or else they need to transparently interoperate with builtin str objects. The potential viability and utility of such a tagged string type, however, isn't a particularly strong argument for anything relating to the bytes API - it's pretty clear that Guido's plan to break the 8-bit-data-as-text paradigm in Python 3 has succeeded to that extent. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

TBH, your experience showed that trying to write "polymorphic" code manipulating either-str-or-bytes-meaning-text is too ugly to care. I don't know if the same is true if one were to just set out to manipulate bytes-meaning-text. FWIW, I haven't changed my mind on swapcase -- I regret it, but (despite acknowledging your experience) value the consistency more than the cost of implementing it. I could live with deprecating it across the board, if only to ease life for PyPy and others. --Guido On Tue, Sep 6, 2011 at 10:26 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Wed, Sep 7, 2011 at 2:36 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I don't know if it's worth the effort to deprecate them, though.
I could live with a purely documentation based deprecation, although I'd prefer to *actually* deprecate at least those four methods on bytes and bytearray objects (since we switched mailing lists, reproducing the list for reference: 'capitalize', 'istitle', 'swapcase', 'title').
FWIW, I actually used to be in that school myself, *until* I took on the task of making more of the urllib.parse APIs take a polymorphic bytes-in-bytes-out, str-in-str-out approach for 3.2. The difference in complexity between the "right" way (i.e. decoding with the ascii codec, manipulating as Unicode, encoding back to bytes with the ascii codec) and a hackier approach that tried to manipulate the bytes directly was such that I didn't even end up benchmarking the two approaches to decide between them - I ended up having zero interest in attempting to maintain the latter version, so the implicit decode/encode is the version that went into the release. That experience pushed me solidly in the direction of arbitrary fast ASCII text manipulation without encoding/decoding overhead in Python 3 being a task for a third party type - neither bytes nor str fit the bill. To be really effective, such a type either needs algorithms dedicated to using it so that all the associated 'literals' are predefined as objects of the relevant type and don't need to worry about handling actual strings being passed in or else they need to transparently interoperate with builtin str objects. The potential viability and utility of such a tagged string type, however, isn't a particularly strong argument for anything relating to the bytes API - it's pretty clear that Guido's plan to break the 8-bit-data-as-text paradigm in Python 3 has succeeded to that extent. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

TBH, your experience showed that trying to write "polymorphic" code manipulating either-str-or-bytes-meaning-text is too ugly to care. I don't know if the same is true if one were to just set out to manipulate bytes-meaning-text. FWIW, I haven't changed my mind on swapcase -- I regret it, but (despite acknowledging your experience) value the consistency more than the cost of implementing it. I could live with deprecating it across the board, if only to ease life for PyPy and others. --Guido On Tue, Sep 6, 2011 at 10:26 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)
participants (3)
-
Guido van Rossum
-
Nick Coghlan
-
Stephen J. Turnbull