Is this really a good idea? PEP 460 proposes rather different semantics for bytes.format and the bytes % operator from the str versions. I think this is going to be both confusing and a continuous target for "further improvement" until the two implementations converge. Nick Coghlan writes:
I still don't think the 2.x bytestring is inherently evil, it's just the wrong type to use as the core text type because of the problems it has with silently creating mojibake and also with multi-byte codecs and slicing. The current python-ideas thread is close to convincing me even a stripped down version isn't a good idea, though :P
Lack of it is obviously a major pain point for many developers, but -- it is inherently evil. It's a structured data type passed around as an unstructured blob of memory, with no way for one part of the program to determine what (if anything) another part of the program thinks it's doing. It's the Python equivalent to the pointer type aliasing that gcc likes to whine about. Given that most wire protocols that benefit from this kind of thing are based on ASCII-coded commands and parameters, I think there's a better alternative to either adding 2.x bytestrings as a separate type or to PEP 460. This is to add a (minimal) structure we could call "ASCII-compatible byte array" to the current set of Unicode representations. The detailed proposal is on -ideas (where I call it "7-bit representation", but that has already caused misunderstanding.) This representation would treat non-ASCII bytes as the current representations do bytes encoded as surrogates. This representation would be produced only by a special "ascii-compatible" codec (which implies the surrogateescape- like behavior). It has the following advantages for bytestring-type processing: - double-encoding/decoding is not possible - uninterpreted bytes are marked as such -- they can be compared for equality, but other character manipulations are no-ops. - representation is efficient - output via the 'ascii-compatible' codec is just memcpy - input via the 'ascii-compatible' codec is reasonably efficient (in the posted proposal detection of non-ASCII bytes is required, so it cannot be just memcpy) - str operations are all available; only on I/O is any additional overhead imposed compared to str There's one other possible advantage that I haven't thought through yet: compatibility with 2.x literals (eg, "inputstring.find('To:')" instead of "inputbytes.find(b'To:')"). It probably does impose overhead compared to bytes, especially with the restricted functionality Victor proposes for .format() on bytes, but as Victor points out so does any full-featured string-style processing vs. low-level operations like .join(). I suppose it would be acceptable, except possibly the extra copying for I/O. The main disadvantage is additional complexity in the implementation of the str type. I don't think it imposes much runtime overhead, however, since the checks for different representations when operating on str must be done anyway. Operations involving "ascii-compatible" and other representations at the same time should be rare, except for the combinations of "ascii-compatible" and 8-bit representations -- which just involve copying bytes as between 8-bit and 8-bit, plus a bit of logic to set the type correctly. Steve