[Python-ideas] a new bytestring type?
Andrew Barnert
abarnert at yahoo.com
Mon Jan 6 12:16:05 CET 2014
From: Nick Coghlan <ncoghlan at gmail.com>
Sent: Sunday, January 5, 2014 2:57 PM
>I actually expected someone to have experimented with an "encodedstr" type by now. This would be a type that behaved like the Python 2 str type, but had an encoding attribute. On encountering Unicode text strings, it would encode then appropriately.
I did something like this when I was first playing with 3.0, and I managed to find it.
I tried two different implementations, a bytes subclass that fakes being a str as well as possible by decoding on the fly (or, in some cases, by encoding its arguments on the fly), and a str that fakes being a bytes as well as possible by doing the opposite.
>However, people have generally instead followed the model of decoding to text and operating in that domain, since it avoids a lot of subtle issues (like accidentally embedding byte order marks when concatenating strings).
It's also conceptually cleaner to work with text as text instead of as bytes that you can sort of use as text.
Also, one major reason people resist working with text (or upgrading to 3.x) is the perceived performance costs of dealing with Unicode. But if you want to do any kind of string processing on your text beyond searching for ASCII header names and the like, you pretty much have to do it as Unicode or it's wrong. So, you'd need something that allows you to do those ASCII header searches in 8-bit-land, but either doesn't allow full string processing, or automatically decodes and re-encodes on the fly (which obviously isn't going to be faster).
>This is likely encouraged by the fact that str, bytes and bytearray don't currently implement type coercion correctly (which in turn is due to a long standing bug in the way the abstract C API handles sequence types defined in C rather than Python), so an encodedstr type would need to inherit from str or bytes to get interoperability, and then wouldn't interoperate with the other one.
What's the bug? Anyway, I started off with the idea of inheriting from str or bytes in the first place because it seemed more natural than delegating, so I guess I didn't run into it.
In general, it seems like you can interoperate just fine; an ebytes or estr (the names of my two classes) can, e.g., find, format, join, radd, whatever a bytes, str, ebytes, or estr without a problem, returning the appropriate types.
The problem is interacting with functions that explicitly want the other type. This includes C functions that, e.g., take a "U" parameter, like TextIOWrapper.write, but it's just as much of a problem with Python functions that check isinstance(str) (either to reject bytes, or to switch and do different things on bytes and str). So, you have to write things like "f.write(str(s))" instead of "f.write(s)" all over the place.
There's also a problem with functions that will take a str and do something useful, or take a bytes and do something stupid, like assume it must be in the appropriate encoding for the filesystem. An ebytes just looks like a bytes to such functions, and therefore does the wrong thing. Again, you have to do things like "open(str(s))"—and, if you don't, instead of an error you get silent mojibake. (Which I guess is a good simulation of the Python 2 str type after all…)
I couldn't find a way around the problem for ebytes. For estr, I fought for a while to make it support the buffer protocol (I wrote a Cython wrapper to let me delegate to another buffer from Python so I wouldn't have to write the whole thing in C), which fixes the problems with most C API functions, but doesn't help at all for Python functions.
Meanwhile, there are some design issues that aren't entirely clear.
The most obvious one is the performance issue I raised above. Should we cache the Unicode? Maybe even pre-compute it? I went with no caching just because it was the simplest implementation.
Exactly which methods should act on bytes and which on characters? My initial cut was that searching-related methods like startswith, index, split, or replace should be bytes, while things like casefold and zfill Unicode. The division isn't entirely clear, but it's something to start with. (I also considered switching on the types of the other arguments—e.g., replace would be byte-based when given a bytes or an ebytes of the same encoding, but Unicode-based when given a str or an ebytes of a different encoding—but that seemed overly complicated.)
Should indexing and iteration return numbers, as with bytes?
It's obvious what encode should do (transcode to an ebytes in a different encoding), but what about decode? (I left bytes.decode alone, but I think that was a bad choice; that makes it an inverse to a change_encoding function that reinterprets the bytes as a different encoding, rather than an inverse to encode.)
All that being said, just being able to use format or % with a mix of str and known-encoding-bytes is pretty handy.
Anyway, in case anyone wants to take a look at it, I can't find the Cython wrapper, so I dropped estr, but cleaned up ebytes and made sure it works with 3.3 and 3.4 and uploaded it to https://github.com/abarnert/ebytes. Please forgive the clunky way I wrote all the forwarding methods.
More information about the Python-ideas
mailing list