a new bytestring type?
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'. How many would be interested in having a 'bytestring'? What do you see as the distinguishing characteristics? -- ~Ethan~
How would you see this bytestring type as differentiating itself from
bytes? What use cases do you envision?
On Sun Jan 05 2014 at 11:56:46 AM, Ethan Furman
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
What do you see as the distinguishing characteristics?
-- ~Ethan~ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 01/05/2014 11:58 AM, Amber Yust wrote:
How would you see this bytestring type as differentiating itself from bytes? What use cases do you envision?
I put the questions there so others could fill in the blanks for themselves. I have responded to the original question with two of the differentiating features (the two that bug me most, of course ;). -- ~Ethan~
On 01/05/2014 11:33 AM, Ethan Furman wrote:
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
+1
What do you see as the distinguishing characteristics?
Indexing returns a bytestring of length 1, not an integer `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)' -- ~Ethan~
On Jan 5, 2014, at 12:04 PM, Ethan Furman
On 01/05/2014 11:33 AM, Ethan Furman wrote:
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
+1
"I don't always +1 on python-ideas, but when I do, I do it on my own posts." ;) -- Best regards, Łukasz Langa WWW: http://lukasz.langa.pl/ Twitter: @llanga IRC: ambv on #python-dev
On 01/05/2014 12:30 PM, Łukasz Langa wrote:
On Jan 5, 2014, at 12:04 PM, Ethan Furman
wrote: On 01/05/2014 11:33 AM, Ethan Furman wrote:
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
+1
"I don't always +1 on python-ideas, but when I do, I do it on my own posts."
+1 QOTW !
On 05Jan2014 12:51, Ethan Furman
On 01/05/2014 12:30 PM, Łukasz Langa wrote:
On Jan 5, 2014, at 12:04 PM, Ethan Furman
wrote: On 01/05/2014 11:33 AM, Ethan Furman wrote:
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
+1
"I don't always +1 on python-ideas, but when I do, I do it on my own posts."
+1 QOTW !
+1 QOTW
... but doesn't your +1 falsify the quote you're +1ing?
--
Cameron Simpson
On 01/05/2014 05:29 PM, Cameron Simpson wrote:
On 05Jan2014 12:51, Ethan Furman
wrote: On 01/05/2014 12:30 PM, Łukasz Langa wrote:
On Jan 5, 2014, at 12:04 PM, Ethan Furman
wrote: On 01/05/2014 11:33 AM, Ethan Furman wrote:
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
+1
"I don't always +1 on python-ideas, but when I do, I do it on my own posts."
+1 QOTW !
+1 QOTW
... but doesn't your +1 falsify the quote you're +1ing?
Hrmmm.... well, just in case: +2!
On Sun, 05 Jan 2014 12:04:20 -0800
Ethan Furman
On 01/05/2014 11:33 AM, Ethan Furman wrote:
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
+1
What do you see as the distinguishing characteristics?
Indexing returns a bytestring of length 1, not an integer
`bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
I agree with that, but it's much too late, and I'm -10 on adding another, similar but different, bytestring type. Regards Antoine.
On 6 Jan 2014 03:56, "Ethan Furman"
As anyone who has worked with Python 3 and low-level protocols knows,
Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
What do you see as the distinguishing characteristics?
I actually expected someone to have experimented with an "encodedstr" type by now. This would be a type that behaved like the Python 2 str type, but had an encoding attribute. On encountering Unicode text strings, it would encode then appropriately. However, people have generally instead followed the model of decoding to text and operating in that domain, since it avoids a lot of subtle issues (like accidentally embedding byte order marks when concatenating strings). This is likely encouraged by the fact that str, bytes and bytearray don't currently implement type coercion correctly (which in turn is due to a long standing bug in the way the abstract C API handles sequence types defined in C rather than Python), so an encodedstr type would need to inherit from str or bytes to get interoperability, and then wouldn't interoperate with the other one. Cheers, Nick.
-- ~Ethan~ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
From: Nick Coghlan
I actually expected someone to have experimented with an "encodedstr" type by now. This would be a type that behaved like the Python 2 str type, but had an encoding attribute. On encountering Unicode text strings, it would encode then appropriately.
I did something like this when I was first playing with 3.0, and I managed to find it. I tried two different implementations, a bytes subclass that fakes being a str as well as possible by decoding on the fly (or, in some cases, by encoding its arguments on the fly), and a str that fakes being a bytes as well as possible by doing the opposite.
However, people have generally instead followed the model of decoding to text and operating in that domain, since it avoids a lot of subtle issues (like accidentally embedding byte order marks when concatenating strings).
It's also conceptually cleaner to work with text as text instead of as bytes that you can sort of use as text. Also, one major reason people resist working with text (or upgrading to 3.x) is the perceived performance costs of dealing with Unicode. But if you want to do any kind of string processing on your text beyond searching for ASCII header names and the like, you pretty much have to do it as Unicode or it's wrong. So, you'd need something that allows you to do those ASCII header searches in 8-bit-land, but either doesn't allow full string processing, or automatically decodes and re-encodes on the fly (which obviously isn't going to be faster).
This is likely encouraged by the fact that str, bytes and bytearray don't currently implement type coercion correctly (which in turn is due to a long standing bug in the way the abstract C API handles sequence types defined in C rather than Python), so an encodedstr type would need to inherit from str or bytes to get interoperability, and then wouldn't interoperate with the other one.
What's the bug? Anyway, I started off with the idea of inheriting from str or bytes in the first place because it seemed more natural than delegating, so I guess I didn't run into it. In general, it seems like you can interoperate just fine; an ebytes or estr (the names of my two classes) can, e.g., find, format, join, radd, whatever a bytes, str, ebytes, or estr without a problem, returning the appropriate types. The problem is interacting with functions that explicitly want the other type. This includes C functions that, e.g., take a "U" parameter, like TextIOWrapper.write, but it's just as much of a problem with Python functions that check isinstance(str) (either to reject bytes, or to switch and do different things on bytes and str). So, you have to write things like "f.write(str(s))" instead of "f.write(s)" all over the place. There's also a problem with functions that will take a str and do something useful, or take a bytes and do something stupid, like assume it must be in the appropriate encoding for the filesystem. An ebytes just looks like a bytes to such functions, and therefore does the wrong thing. Again, you have to do things like "open(str(s))"—and, if you don't, instead of an error you get silent mojibake. (Which I guess is a good simulation of the Python 2 str type after all…) I couldn't find a way around the problem for ebytes. For estr, I fought for a while to make it support the buffer protocol (I wrote a Cython wrapper to let me delegate to another buffer from Python so I wouldn't have to write the whole thing in C), which fixes the problems with most C API functions, but doesn't help at all for Python functions. Meanwhile, there are some design issues that aren't entirely clear. The most obvious one is the performance issue I raised above. Should we cache the Unicode? Maybe even pre-compute it? I went with no caching just because it was the simplest implementation. Exactly which methods should act on bytes and which on characters? My initial cut was that searching-related methods like startswith, index, split, or replace should be bytes, while things like casefold and zfill Unicode. The division isn't entirely clear, but it's something to start with. (I also considered switching on the types of the other arguments—e.g., replace would be byte-based when given a bytes or an ebytes of the same encoding, but Unicode-based when given a str or an ebytes of a different encoding—but that seemed overly complicated.) Should indexing and iteration return numbers, as with bytes? It's obvious what encode should do (transcode to an ebytes in a different encoding), but what about decode? (I left bytes.decode alone, but I think that was a bad choice; that makes it an inverse to a change_encoding function that reinterprets the bytes as a different encoding, rather than an inverse to encode.) All that being said, just being able to use format or % with a mix of str and known-encoding-bytes is pretty handy. Anyway, in case anyone wants to take a look at it, I can't find the Cython wrapper, so I dropped estr, but cleaned up ebytes and made sure it works with 3.3 and 3.4 and uploaded it to https://github.com/abarnert/ebytes. Please forgive the clunky way I wrote all the forwarding methods.
On 6 Jan 2014 19:16, "Andrew Barnert"
From: Nick Coghlan
Sent: Sunday, January 5, 2014 2:57 PM I actually expected someone to have experimented with an "encodedstr"
type by now. This would be a type that behaved like the Python 2 str type, but had an encoding attribute. On encountering Unicode text strings, it would encode then appropriately.
I did something like this when I was first playing with 3.0, and I
managed to find it.
I tried two different implementations, a bytes subclass that fakes being
a str as well as possible by decoding on the fly (or, in some cases, by encoding its arguments on the fly), and a str that fakes being a bytes as well as possible by doing the opposite.
However, people have generally instead followed the model of decoding to
text and operating in that domain, since it avoids a lot of subtle issues (like accidentally embedding byte order marks when concatenating strings).
It's also conceptually cleaner to work with text as text instead of as
bytes that you can sort of use as text.
Also, one major reason people resist working with text (or upgrading to
3.x) is the perceived performance costs of dealing with Unicode. But if you want to do any kind of string processing on your text beyond searching for ASCII header names and the like, you pretty much have to do it as Unicode or it's wrong. So, you'd need something that allows you to do those ASCII header searches in 8-bit-land, but either doesn't allow full string processing, or automatically decodes and re-encodes on the fly (which obviously isn't going to be faster).
This is likely encouraged by the fact that str, bytes and bytearray
don't currently implement type coercion correctly (which in turn is due to a long standing bug in the way the abstract C API handles sequence types defined in C rather than Python), so an encodedstr type would need to inherit from str or bytes to get interoperability, and then wouldn't interoperate with the other one.
What's the bug?
http://bugs.python.org/issue11477 CPython doesn't check for NotImplemented results from sq_concat or sq_repeat, so the sequence implementations raise TypeError directly and the RHS doesn't get consulted to see if it can handle the operation. Subclassing works anyway because subclasses are always checked first even when they're the RHS. Thanks for the info on your experiences with attempting to implement an encodedstr type. I still feel there is potential merit to the concept, but it's certainly going to take some thought. Cheers, Nick.
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
"arrays of integers"? You mean, unsigned short ints? There's an important difference. One references an abstraction, and one references a concrete machine type. The other consideration is knowing what you mean by "string", if you mean something to be interpreted textually, then the convention is to use unsigned chars to document your intentions, which "technically" is the same (as far as memory layout is concerned). (I say "technically" because there is some space reserved for endian-ness which can change the bit ordering.)
How many would be interested in having a 'bytestring'?
What do you see as the distinguishing characteristics?
What it *should* have is a bytes-type, which is a raw, 8-bit type which may or may not printable on the screen with quotation marks. Different subtypes, >>>class Text(bytes) can interpret those bytes as they want (as a text string for example, with or without formatting awareness for control codes. Otherwise File(bytes) can interpret those bytes as binary data, so as to write to the file system without any transformation of the codes (i.e. raw). I'm afraid this reply may not be up to the standards of the list, but hopefully has some useful data that has gone without good understanding. MarkJ
"arrays of integers"? You mean, unsigned short ints? There's an important difference. One references an abstraction, and one references a concrete machine type.
The other consideration is knowing what you mean by "string", if you mean something to be interpreted textually, then the convention is to use unsigned chars to document your intentions, which "technically" is the same (as far as memory layout is concerned). (I say "technically" because there is some space reserved for endian-ness which can change the bit ordering.)
One mistake I already wish to correct is in the last sentence: "endian-ness" *always* changes or refers to the bit ordering. Secondly, the term only applies to numerical (always integer, AFAIK) representation -- not for chars. Trying to be complete... MarkJ
"arrays of integers"? You mean, unsigned short ints? There's an important difference. One references an abstraction, and one references a concrete machine type.
The other consideration is knowing what you mean by "string", if you mean something to be interpreted textually, then the convention is to use unsigned chars to document your intentions, which "technically" is the same (as far as memory layout is concerned). (I say "technically" because there is some space reserved for endian-ness which can change the bit ordering.)
One mistake I already wish to correct ... Trying to be complete...
Come to think of it, this issue (the relationship between bytes, text, and char/ints) may be the entire reason Python3 "uptake" hasn't happened. It gets back to the same old argument I've been trying to make about "models of computation". Python3 apparently did not respect the machine and went the way of the "dark side", hence scientific computing hasn't been as quick to convert to Python 3. Specifically, the final issue with regard to bytes (and it's consequent model of computation) is thus: 1) how they maintain representation on the file system (the "disk") vs. 2) how they are represented and managed in memory. This is the primary articulation point regarding how the *abstraction of computing* relates to its *implementation*. This also relates to the Turing Machine and it's articulation with the underlying VonNeumann architecture (implementation). Ned, I hope you're finally understanding this. MarkJ
"arrays of integers"? You mean, unsigned short ints? There's an important difference. One references an abstraction, and one references a concrete machine type.
The other consideration is knowing what you mean by "string", if you mean something to be interpreted textually, then the convention is to use unsigned chars to document your intentions, which "technically" is the same (as far as memory layout is concerned). (I say "technically" because there is some space reserved for endian-ness which can change the bit ordering.) One mistake I already wish to correct ... Trying to be complete... Come to think of it, this issue (the relationship between bytes, text, and char/ints) may be the entire reason Python3 "uptake" hasn't happened. It gets back to the same old argument I've been trying to make about "models of computation". Python3 apparently did not respect the machine and went the way of the "dark side", hence scientific computing hasn't been as quick to convert to Python 3.
Specifically, the final issue with regard to bytes (and it's consequent model of computation) is thus: 1) how they maintain representation on the file system (the "disk") vs. 2) how they are represented and managed in memory. This is the primary articulation point regarding how the *abstraction of computing* relates to its *implementation*. This also relates to the Turing Machine and it's articulation with the underlying VonNeumann architecture (implementation).
Ned, I hope you're finally understanding this. Mark, I think you are confusing my posts in Python-List with this
On 1/5/14 9:00 PM, Mark Janssen wrote: thread. I would rather you didn't address me: my interactions with you in the past have been unpleasant, especially where we've tried to get to the bottom of one of your typically obscure references to the theory of computation. You've mocked and ignored me when I've tried to treat your ideas with respect, so I'm not going to make that mistake again. --Ned.
MarkJ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Ethan Furman writes:
How many would be interested in having a 'bytestring'?
-1. It's an attractive nuisance.
What do you see as the distinguishing characteristics?
Its main attraction is that it allows people who in practice only ever deal with one non-Unicode encoding to ignore the fact that their data is in fact encoded, and that their applications are very likely not robust to data encoded differently. While I sympathize with their problem to some extent (especially people who are writing low-level web services), I don't think you'd ever again be able to trust a 3rd- party module in a web context without doing a thorough audit to ensure that all uses of 'bytestrings' are appropriate in themselves and appropriately guarded against leaking garbage into other contexts. Thus, "attractive nuisance."
On Sun, Jan 5, 2014 at 8:33 PM, Ethan Furman
As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
I'm not missing a new type, but I am missing the format method on the binary types. Regards, Geert
Geert Jansen writes:
I'm not missing a new type, but I am missing the format method on the binary types.
I'm curious about precisely what your use cases are, and just what formatting they need. The problem that Python 2 code has over and over imposed on me is that the temptation to avoid the overhead of conversion to and then from unicode when processing text by just using str results in the equivalent of bs1 = returns_a_bytestring_encoded_in_utf8() bs2 = returns_a_bytestring_encoded_in_koi8() bs3 = b'{0} {1}'.format(bs1, bs2) # and lose big when something expects valid UTF-8 in bs3 In low-level code, the assignments to bs1, bs2, and bs3 are likely to be in three separate contexts, even three separate modules. I understand about consenting adults, but it's just too hard to enforce good practice here if you make it easy to pass around and operate on encoded bytestrings. I don't see how you avoid this pitfall, except by making it easier to pass around Unicode than encoded strings. And given that encoding and decoding are unavoidable, that means making use of bytestrings with text semantics painful. So to answer my question from my own point of view, for example, I would have no problem at all with b'{0:c}'.format(27) == b'\x1b' # insert an ASCII ESC character I would be leery of b'{0:s}'.format(b'\x1b[M') == b'\x1b[M' # insert a ANSI control sequence for the reason given above (for this use case, I would prefer blue_code = ord('M') # Or b'M', doesn't matter! b'\x1b[{0:c}'.format(blue_code) == b'\x1b[M' -- and forgive me for not looking up my ANSI color sequences, it's only luck if that's close) and I would consider b'{0:d}'.format(27) == b'27' # insert the ASCII representation to be an abomination since there's no reason to suppose that any given bytestring is encoded in an ASCII-compatible way, or bigendian for that matter. Ditto everything else that involves representing a number as a string of numeric characters.
On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull
I'm not missing a new type, but I am missing the format method on the binary types.
I'm curious about precisely what your use cases are, and just what formatting they need.
One use case I came across was when creating chunks for the HTTP chunked encoding. Chunks contain a ascii header, a raw/encoded chunk body, and an ascii trailer. Using a bytes.format, it would look like this: chunk = '{0:X}\r\n{1}\r\n'.format(len(buf), buf) This is what I am using now: chunk = bytearray() chunk.extend('{0:X}\r\n'.format(len(buf)).encode('ascii')) chunk.extend(buf) chunk.extend('\r\n'.encode('ascii')) Regards, Geert
The problem that Python 2 code has over and over imposed on me is that the temptation to avoid the overhead of conversion to and then from unicode when processing text by just using str results in the equivalent of
bs1 = returns_a_bytestring_encoded_in_utf8() bs2 = returns_a_bytestring_encoded_in_koi8()
bs3 = b'{0} {1}'.format(bs1, bs2) # and lose big when something expects valid UTF-8 in bs3
In low-level code, the assignments to bs1, bs2, and bs3 are likely to be in three separate contexts, even three separate modules. I understand about consenting adults, but it's just too hard to enforce good practice here if you make it easy to pass around and operate on encoded bytestrings. I don't see how you avoid this pitfall, except by making it easier to pass around Unicode than encoded strings. And given that encoding and decoding are unavoidable, that means making use of bytestrings with text semantics painful.
So to answer my question from my own point of view, for example, I would have no problem at all with
b'{0:c}'.format(27) == b'\x1b' # insert an ASCII ESC character
I would be leery of
b'{0:s}'.format(b'\x1b[M') == b'\x1b[M' # insert a ANSI control sequence
for the reason given above (for this use case, I would prefer
blue_code = ord('M') # Or b'M', doesn't matter! b'\x1b[{0:c}'.format(blue_code) == b'\x1b[M'
-- and forgive me for not looking up my ANSI color sequences, it's only luck if that's close) and I would consider
b'{0:d}'.format(27) == b'27' # insert the ASCII representation
to be an abomination since there's no reason to suppose that any given bytestring is encoded in an ASCII-compatible way, or bigendian for that matter. Ditto everything else that involves representing a number as a string of numeric characters.
I didn't receive Stephen's email, so forgive me for replying through a reply…
From: Geert Jansen
On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull
wrote: > I'm not missing a new type, but I am missing the format method on the > binary types.
I'm curious about precisely what your use cases are, and just what formatting they need.
Besides Geert's chunked HTTP example, there are tons of intern protocols and file formats (including Python source code!), that have ASCII headers (that in some way define an encoding for the actual payload). So things like b'Content-Length: {}'.format(len(payload)) or even b'Content-Type: text/html; charset={}'.format(encoding) are useful.
… I would consider
b'{0:d}'.format(27) == b'27' # insert the ASCII representation
to be an abomination since there's no reason to suppose that any given bytestring is encoded in an ASCII-compatible way, or bigendian for that matter. Ditto everything else that involves representing a number as a string of numeric characters.
Endianness isn't relevant here; b'{}'.format(32768) is b'32768', not b'\x80\x00' or b'\x00\x80'. That's what the d format means. As for assuming that it's ASCII-compatible, again, there are all kinds of protocols that work with any ASCII-compatbile charset but don't work otherwise. Yeah, this can be a problem if you want to create an HTTP page or a Python source file in EBCDIC or UTF-16-LE—but even then, if the headers are interpreted as pure ASCII and then the payload is extracted and decoded separately, it still works. In fact, it works better than if people try to construct everything as text end then encode, giving you illegal/unreadable EBCDIC headers, and this is a common incorrect workaround that Python 2-familiar people do when forced to deal with Python 3. Obviously you could solve most of the same problems by formatting the headers as text, encoding them to ASCII, then concatenating the payload. And I'm not really worried about performance issues with that. But I am worried about convenience and readability—compare the desired and actual versions of Geert's code. As I said in my other email, I might be happy assuming ASCII-strict for everything that isn't a buffer, and copying bytes as-is for everything that is. That _might_ be more of an attractive nuisance than a useful feature, but… it definitely is attractive, and I'm not sure it's a nuisance.
Aside: I just read Victor's PEP 460, and apparently a lot of the assumptions I'm making are true! Andrew Barnert writes:
From: Geert Jansen
On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull
wrote: > I'm not missing a new type, but I am missing the format method on
the binary types.
I'm curious about precisely what your use cases are, and just what formatting they need.
Besides Geert's chunked HTTP example, there are tons of intern protocols and file formats (including Python source code!),
Python source code must use an ASCII-compatible encoding to use PEP 263. No widechars, no EBCDIC. But yes, I know about ASCII header formats -- I'm a Mailman developer.
that have ASCII headers (that in some way define an encoding for the actual payload). So things like b'Content-Length: {}'.format(len(payload)) or even b'Content-Type: text/html; charset={}'.format(encoding) are useful.
Useful, sure. But that much more useful than the alternative? What's wrong with def itob(n): # besides efficiency :-) return "{0:d}".format(n).encode('ascii') b'Content-Length: ' + itob(len(payload)) b'Content-Type: text/html; charset=' + encoding for such cases? Not to forget that for cases with multiple parts to combine, bytes.join() is way fast -- which matters to most people who want these operations. So I just don't see a real need for generic formatting operations here. (regex is another matter, but that's already implemented.)
As for assuming that it's ASCII-compatible, again, there are all kinds of protocols that work with any ASCII-compatbile charset but don't work otherwise.
If you *can* assume it's ASCII-compatible bytes, what's wrong with str in Python 3? The basic idea is to use inbytes.decode('ascii', errors='surrogateescape') which will DTRT if you try to encode it without the surrogateescape handler: it raises an exception unless the bytes is pure ASCII. It's memory-efficient for pure ASCII, and has all the string facilities we love. But of course it would be too painful for sending JPEGs by chunked HTTP a la Geert. So ... now that we have the flexible string representation (PEP 393), let's add a 7-bit representation! (Don't take that too seriously, there are interesting more general variants I'm not going to talk about tonight.) The 7-bit representation satisfies the following requirements: 1. It is only produced on input by a new 'ascii-compatible' codec, which sets the "7-bit representation" flag in the str object on input if it encounters any non-ASCII bytes (if pure ASCII, it produces an 8-bit str object). This will be slower than just reading in the bytes in many cases, but I hope not unacceptably so. 2. When sliced, the result needs to be checked for non-ASCII bytes. If none, the result is promoted to 8-bit. 3. When combined with a str in 8-bit representation: a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler. b. Otherwise they're combined into a 7-bit str. 4. When combined with a str in 16-bit or 32-bit representation, the 7-bit string is "decoded" to the same representation, as if using the 'ascii' codec with the 'surrogateescape' handler. 5. String methods that would raise or produce undefined results if used on str containing surrogate-encoded bytes need to be taught to do the same on non-ASCII bytes in 7-bit str objects. 6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else. (Sorry, no, ISO 8859-1 does *not* get passed through without exception.) 7. On output other codecs raise on a 7-bit str, unless the surrogateescape handler is in use. IOW, it's almost as fast as bytes if you restrict yourself to ASCII- compatible behavior, and you pay the price if you try to mix it with "real" Unicode str objects. Otherwise you can do anything with it you could do with a str. I don't think this actually has serious efficiency implications for Unicode handling, since the relevant compatibility tests need to be done anyway when combining strs. All the expensive operations occur when mixing 7-bit str and "real" non-ASCII Unicode, but we really don't want to do that if we can avoid it, any more than we want to use surrogate encoding if we can avoid it. Efficiency for low-level protocols could be improved by having the 'ascii-compatible' codec always produce 7-bit. I haven't thought carefully about this yet. For same reasons, there should be few surprises where people inadvertantly mix 7-bit str with "real" Unicode, since creating 7-bit is only done by the 'ascii-compatible' codec. People who are doing that will be using ASCII compatible protocols and should be used to being careful with non-ASCII bytes. Finally, none of the natural idioms require a b prefix on their literals. :-) N.B. Much of the above assumes that working with Unicode in 8-bit representation is basically as efficient as working with bytes. That is an assumption on my part, I hope it's verified. Comments?
On 01/06/2014 10:37 AM, Stephen J. Turnbull wrote:
Comments?
Having a 7-bit str variant is definitely an interesting idea, but it wouldn't help me and is probably insufficient for network protocols as well. The binary data I deal with occupies the full 0-255 range, some of which is actually encoded text (and I decode it before passing it back to the user), some of which is simple binary data, and some of which is simple ASCII (metadata about fields and whatnot). -- ~Ethan~
Ethan Furman writes:
Having a 7-bit str variant is definitely an interesting idea, but it wouldn't help me and is probably insufficient for network protocols as well.
I'd like evidence for that latter.
The binary data I deal with occupies the full 0-255 range,
My proposal deals with such data. It simply prevents the program from interpreting the 128-255 range as Unicode characters. You can still use regexps etc on the full range 0-255.
some of which is actually encoded text (and I decode it before passing it back to the user), some of which is simple binary data, and some of which is simple ASCII (metadata about fields and whatnot).
You're wrong, it would help you. Encoded text must be decoded, and in that case it doesn't help you. Unless you can treat it as a single ASCII-compatible encoding (eg, this works for ISO-8859 or KOI8), when the proposal wins for you. Binary data and pure ASCII, the proposal wins for you, unless you're worried about spurious recognition of the binary data as ASCII metadata. In that last case, again, nothing is going to help you as it's a domain problem. My proposal is undefeated in your use case.
On 01/06/2014 09:05 PM, Stephen J. Turnbull wrote:
Ethan Furman writes:
The binary data I deal with occupies the full 0-255 range,
My proposal deals with such data. It simply prevents the program from interpreting the 128-255 range as Unicode characters. You can still use regexps etc on the full range 0-255.
some of which is actually encoded text (and I decode it before passing it back to the user), some of which is simple binary data, and some of which is simple ASCII (metadata about fields and whatnot).
You're wrong, it would help you. Encoded text must be decoded, and in that case it doesn't help you. Unless you can treat it as a single ASCII-compatible encoding (eg, this works for ISO-8859 or KOI8), when the proposal wins for you. Binary data and pure ASCII, the proposal wins for you, unless you're worried about spurious recognition of the binary data as ASCII metadata. In that last case, again, nothing is going to help you as it's a domain problem. My proposal is undefeated in your use case.
I just read your proposal again, and must admit I don't understand how it would help me, but I look forward to testing an implementation! One wrinkle, though -- the data is binary, and if read would have to be read using the latin1 codec... although, I suppose I could open it, read the first 32 bytes, close it, figure out the encoding, reopen with the encoding.... hmmmm -- yup, still not sure how it would all work, but looking forward to testing it. -- ~Ethan~
Ethan Furman writes:
I just read your proposal again, and must admit I don't understand how it would help me, but I look forward to testing an implementation!
One wrinkle, though -- the data is binary, and if read would have to be read using the latin1 codec...
That depends on what you mean by "binary". If the binary payload is just a blob that gets passed on (eg, as in an HTTP client receiving and storing a JPEG file), you read the stream as 'ascii-compatible', parse the headers using regexps or whatever, print any relevant parsed data to logs using 'ascii-compatible', slice off the blob, and write the blob to disk as 'ascii-compatible'. This has the advantage over latin1 that the bytes are marked as "uninterpreted text". It doesn't mean you can't create mojibake; you still can. But Python will complain if you try to output it as text in an encoding (unless you use the 'surrogateescape' handler, in which case you're explicitly accepting responsibility for any mess you create). If you mean to process the binary, it would depend on what you want to do whether it would help or not. struct- and ctypes-style processing, no, it won't help because you need to convert to bytes to use those. (It might make sense to read the headers into a buffer this way, parse them as ASCII-compatible text, and then read the rest as bytes.) Pure byte code, doesn't help, although it probably doesn't hurt.
On 01/07/2014 05:00 AM, Stephen J. Turnbull wrote:
Ethan Furman writes:
I just read your proposal again, and must admit I don't understand how it would help me, but I look forward to testing an implementation!
One wrinkle, though -- the data is binary, and if read would have to be read using the latin1 codec...
If you mean to process the binary, it would depend on what you want to do whether it would help or not. struct- and ctypes-style processing, no, it won't help because you need to convert to bytes to use those. (It might make sense to read the headers into a buffer this way, parse them as ASCII-compatible text, and then read the rest as bytes.) Pure byte code, doesn't help, although it probably doesn't hurt.
Sounds like it doesn't help me then. My binary stream is mixed: - binary that has to be converted (4-byte ints, for example) - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields) and the precise location of each varies from file to file. -- ~Ethan~
On Tue, 07 Jan 2014 08:48:05 -0800
Ethan Furman
- ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
What is the difference supposed to be between those two? Regards Antoine.
On 01/07/2014 09:57 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 08:48:05 -0800 Ethan Furman
wrote: - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
What is the difference supposed to be between those two?
The method used for conversion and the return type: - ascii-encoded text: b'123' --> int(123) - encoded text (ascii or russian or asian or ...): b'abc' --> u'abc' and for completeness: - binary integer: b'\x00\x01' --> int(1) -- ~Ethan~
On Tue, 07 Jan 2014 10:10:19 -0800
Ethan Furman
On 01/07/2014 09:57 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 08:48:05 -0800 Ethan Furman
wrote: - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
What is the difference supposed to be between those two?
The method used for conversion and the return type:
- ascii-encoded text: b'123' --> int(123) - encoded text (ascii or russian or asian or ...): b'abc' --> u'abc'
I'm sorry, I still don't parse this. What is it in Python 3.3 that prevents you from doing this? Regards Antoine.
On 01/07/2014 10:47 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 10:10:19 -0800 Ethan Furman
wrote: On 01/07/2014 09:57 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 08:48:05 -0800 Ethan Furman
wrote: - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
What is the difference supposed to be between those two?
The method used for conversion and the return type:
- ascii-encoded text: b'123' --> int(123) - encoded text (ascii or russian or asian or ...): b'abc' --> u'abc'
I'm sorry, I still don't parse this. What is it in Python 3.3 that prevents you from doing this?
Nothing at all, and that part works fine. The trouble (for me) comes in when I try to use single bytes, either when creating or extracting. The above examples were to show that Stephen J Turnbull's idea wouldn't work for me. -- ~Ethan~
On Tue, 07 Jan 2014 10:57:04 -0800
Ethan Furman
Nothing at all, and that part works fine.
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3... Regards Antoine.
On 01/07/2014 11:59 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 10:57:04 -0800 Ethan Furman
wrote: Nothing at all, and that part works fine.
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3...
No, I'm not. I don't think of b'C' as the integer 67 any more than I think of the number 256 as the bytes b'\x01\xFF'. I don't think of a series of bytes as a container anymore than I think of a series of characters as a container. -- ~Ethan~
On Tue, 07 Jan 2014 12:07:15 -0800
Ethan Furman
On 01/07/2014 11:59 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 10:57:04 -0800 Ethan Furman
wrote: Nothing at all, and that part works fine.
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3...
No, I'm not. I don't think of b'C' as the integer 67 any more than I think of the number 256 as the bytes b'\x01\xFF'.
Ethan, can you please show a practical issue you're having?
On 01/07/2014 12:08 PM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 12:07:15 -0800 Ethan Furman
wrote: On 01/07/2014 11:59 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 10:57:04 -0800 Ethan Furman
wrote: Nothing at all, and that part works fine.
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3...
No, I'm not. I don't think of b'C' as the integer 67 any more than I think of the number 256 as the bytes b'\x01\xFF'.
Ethan, can you please show a practical issue you're having?
Seriously? You've already agreed with me on my first two points at the beginning of this thread. It's safe to assume I was having practical issues with those points. -- ~Ethan~
On Tue, 07 Jan 2014 12:49:11 -0800
Ethan Furman
On 01/07/2014 12:08 PM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 12:07:15 -0800 Ethan Furman
wrote: On 01/07/2014 11:59 AM, Antoine Pitrou wrote:
On Tue, 07 Jan 2014 10:57:04 -0800 Ethan Furman
wrote: Nothing at all, and that part works fine.
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3...
No, I'm not. I don't think of b'C' as the integer 67 any more than I think of the number 256 as the bytes b'\x01\xFF'.
Ethan, can you please show a practical issue you're having?
Seriously? You've already agreed with me on my first two points at the beginning of this thread. It's safe to assume I was having practical issues with those points.
Well, I agree with those points, but I still think they're minor, and not very hard to workaround. Hence my comment about "exagerating the trouble". Regards Antoine.
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3...
No, I'm not. I don't think of b'C' as the integer 67 any more than I think of the number 256 as the bytes b'\x01\xFF'.
There's something fundamentally wrong with these brainfarts coming out on the list. Just how, Ethan, did you think you could represent binary data in a text string, whether preceded by the char 'b' or not? What did you think you would do when you got to character 0, the first (pseudo)-symbol in ASCII? Why don't you jackasses start listening instead of wanking each other with bullshit? markj
I'm `PyMySQL https://github.com/pymysql/pymysql`_ (pure Python MySQL
driver) developer.
I share my experience that I've suffered by bytes doesn't have %-format.
`MySQL-python https://github.com/farcepest/MySQLdb1`_ is a most major
DB-API 2.0 driver for MySQL.
Other MySQL drivers like PyMySQL, MySQL-connector-python are designed
compatible it as possible.
MySQL-python uses 'format' paramstyle.
http://www.python.org/dev/peps/pep-0249/#paramstyle
https://github.com/farcepest/MySQLdb1/blob/master/MySQLdb/__init__.py#L27
MySQL protocol is basically encoded text, but it may contain arbitrary
(escaped) binary.
Here is simplified example constructing real SQL from SQL format and
arguments. (Works only on Python 2.7)
def escape_string(s):
return s.replace("'", "''")
def convert(x):
if isinstance(x, unicode):
x = x.encode('utf-8') # Use encoding assigned to connection in
real.
if isinstance(x, str):
x = "'" + escape_string(x) + "'" # 'quoted and '' escaped string'
else:
x = str(x) # like 42
return x
def build_query(query, *args):
if isinstance(query, unicode):
query = query.encode('utf-8')
return query % tuple(map(convert, args))
textdata = b"hello"
bindata = b"abc\xff\x00"
query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
I can't port this to Python 3.
Fortunately, MySQL supports hex string like x'616263ff00'
So I use it and PyMySQL supports binary data on Python 3.
But hex string consumes double space than normal (escaped) bytes.
This is why I don't use hexstring on Python 2.
https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/converters.py#L303
https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/converters.py#L71
On Wed, Jan 8, 2014 at 8:20 AM, Mark Janssen
The trouble (for me) comes in when I try to use single bytes, either when creating or extracting.
Hmm... aren't you exagerating the trouble? It's not very difficult to work with single bytes in Python 3...
No, I'm not. I don't think of b'C' as the integer 67 any more than I think of the number 256 as the bytes b'\x01\xFF'.
There's something fundamentally wrong with these brainfarts coming out on the list. Just how, Ethan, did you think you could represent binary data in a text string, whether preceded by the char 'b' or not? What did you think you would do when you got to character 0, the first (pseudo)-symbol in ASCII?
Why don't you jackasses start listening instead of wanking each other with bullshit?
markj _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
INADA Naoki
On Wed, 8 Jan 2014 09:50:30 +0900
INADA Naoki
textdata = b"hello"
textdata shouldn't be a bytes object! If it's text it's a str.
bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
I can't port this to Python 3.
I'm sure you can port it. Just decode your bindata using surrogateescape: bindata = bindata.decode('utf8', 'surrogateescape') and then encode the query at the end: query = query.encode('utf8', 'surrogateescape') It will be a little slower, though. Regards Antoine
On Wed, Jan 8, 2014 at 7:34 PM, Antoine Pitrou
On Wed, 8 Jan 2014 09:50:30 +0900 INADA Naoki
wrote: textdata = b"hello"
textdata shouldn't be a bytes object! If it's text it's a str.
PyMySQL and MySQL-python supports both of unicode text and encoded text. So bytes may be text in MySQL if it inserted into TEXT or VARCHAR column.
bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
I can't port this to Python 3.
I'm sure you can port it. Just decode your bindata using surrogateescape:
bindata = bindata.decode('utf8', 'surrogateescape')
and then encode the query at the end:
query = query.encode('utf8', 'surrogateescape')
It will be a little slower, though.
You're right. I've not considered using surrogateescape here. But MySQL connection may be not utf8. It's default latin1 and you can use many encoding. Some encoding doesn't ensure roundtrip. In such encoding, bindata = bindata.decode('sjis', 'surrogateescape') query = query % bindata query.encode('sjis', 'surrogateescape') may break bindata. I may be able to ascii for decoding when mysql uses ascii compatible encoding. bindata = bindata.decode('ascii', 'surrogateescape') query = query % bindata query.encode('sjis', 'surrogateescape') But I think decode/encode with surrogateescape is not only slow, but also dangerous when using encoding except ascii or utf8.
Regards
Antoine
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
INADA Naoki
On Wed, 8 Jan 2014 20:31:10 +0900
INADA Naoki
You're right. I've not considered using surrogateescape here.
But MySQL connection may be not utf8. It's default latin1 and you can use many encoding. Some encoding doesn't ensure roundtrip. In such encoding,
[...]
But I think decode/encode with surrogateescape is not only slow, but also dangerous when using encoding except ascii or utf8.
You're right. Thanks exposing your use case, I think it's a good data point for the bytes formatting PEP. Regards Antoine.
FYI, I can make sample data that is not roundtrip easily with iso2022-jp
encoding.
In [5]: b'\x1b$B\x1b(B'.decode('iso2022_jp')
Out[5]: ''
In [6]: b'\x1b$B\x1b(B'.decode('iso2022_jp',
'surrogateescape').encode('iso2022_jp', 'surrogateescape')
Out[6]: b''
On Wed, Jan 8, 2014 at 8:38 PM, Antoine Pitrou
On Wed, 8 Jan 2014 20:31:10 +0900 INADA Naoki
wrote: You're right. I've not considered using surrogateescape here.
But MySQL connection may be not utf8. It's default latin1 and you can use many encoding. Some encoding doesn't ensure roundtrip. In such encoding,
[...]
But I think decode/encode with surrogateescape is not only slow, but also dangerous when using encoding except ascii or utf8.
You're right. Thanks exposing your use case, I think it's a good data point for the bytes formatting PEP.
Regards
Antoine.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
INADA Naoki
INADA Naoki writes: On Wed, Jan 8, 2014 at 7:34 PM, Antoine Pitrou
wrote: INADA Naoki wrote:
Some encoding doesn't ensure roundtrip.
In that case, in Python 2 you're depending on all "text" to be encoded in the same encoding. And even so you may be in trouble: def convert(x): if isinstance(x, unicode): x = x.encode(round_trip_not_guaranteed) could cause your query to fail when it should succeed. 'x' is user-supplied data, so you have no control over that.
I may be able to ascii for decoding when mysql uses ascii compatible encoding.
You can *always* use 'ascii', 'latin1', or 'utf-8' with 'surrogateescape' for decoding, and roundtrip is guaranteed.
But I think decode/encode with surrogateescape is not only slow,
Evidence? Especially as compared with the connection overhead of the DBMS?
but also dangerous when using encoding except ascii or utf8.
Or latin1. But here's your code as translated to Python 3.3, assuming a connection encoding of Shift JIS: # unchanged source, but this is Python 3 str == Unicode def escape_string(s): return s.replace("'", "''") def convert(x): if isinstance(x, str): # Correct type unicode->str x = "'" + escape_string(x) + "'" elif isinstance(x, bytes): # Correct type str->bytes # SAFE: ASCII is a Unicode subset, RT guaranteed. x = x.decode('ascii', errors='surrogateescape') x = "'" + escape_string(x) + "'" else: x = str(x) return x def build_query(query, *args): if isinstance(query, bytes): # want str for the format operator query = query.decode('sjis') query = query % tuple(map(convert, args)) # CORRECT: for ASCII-compatible encodings, including Shift # JIS and Big 5, since the binary blob doesn't contain any # non-ASCII characters and the non-character bytes 128-255 # will be restored properly by the error handler. return query.encode('sjis', errors='surrogate-escape') textdata = b"hello" # or "hello" bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s" print build_query(query, textdata, bindata) The only problem with correctness will occur if the MySQL connection uses a non-ASCII-compatible encoding (UTF-16, fixed-width EUC) in the query string, because the ASCII bytes in the blob will be "widened" by "encode". Widechar encodings could actually be handled with a "binary" codec that recognizes *no* characters and always surrogate-encodes every byte. But that's pretty obviously going to be unacceptable. I guess bytes.format() is pretty well unstoppable at this point.
INADA Naoki writes:
I share my experience that I've suffered by bytes doesn't have %-format. `MySQL-python is a most major DB-API 2.0 driver for MySQL. MySQL-python uses 'format' paramstyle.
MySQL protocol is basically encoded text, but it may contain arbitrary (escaped) binary. Here is simplified example constructing real SQL from SQL format and arguments. (Works only on Python 2.7)
'>' quotes are omitted for clarity and comments deleted. def escape_string(s): return s.replace("'", "''") def convert(x): if isinstance(x, unicode): x = x.encode('utf-8') if isinstance(x, str): x = "'" + escape_string(x) + "'" else: x = str(x) return x def build_query(query, *args): if isinstance(query, unicode): query = query.encode('utf-8') return query % tuple(map(convert, args)) textdata = b"hello" bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s" print build_query(query, textdata, bindata)
I can't port this to Python 3.
Why not? The obvious translation is # This is Python 3!! def escape_string(s): return s.replace("'", "''") def convert(x): if isinstance(x, bytes): x = escape_string(x.decode('ascii', errors='surrogateescape')) x = "'" + x + "'" else: x = str(x) return x def build_query(query, *args): query = query % tuple(map(convert, args)) return query.encode('utf-8', errors='surrogateescape') textdata = "hello" bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s" print build_query(query, textdata, bindata) The main issue I can think you might have with this is that there will need to be conversions to and from 16-bit representations, which take up unnecessary space for bindata, and are relatively slow for bindata. But it seems to me that these are second-order costs compared to the other work an adapter needs to do. What am I missing? With the proposed 'ascii-compatible' representation, if you have to handle many MB of binary or textdata with non-ASCII characters, def convert(x): if isinstance(x, str): x = x.encode('utf-8').decode('ascii-compatible') elif isinstance(x, bytes): x = escape_string(x.decode('ascii-compatible')) x = "'" + x + "'" else: x = str(x) # like 42 return x def build_query(query, *args): query = convert(query) % tuple(map(convert, args)) return query.encode('utf-8', errors='surrogateescape') ensures that the '%' format operator is always dealing with 8-bit representations only. There might be a conversion from 16-bit to 8-bit for str, but there will be no conversions from 8-bit to 16-bit representations. I don't know if that makes '%' itself faster, but it might.
You're right.
As I said previous mail, I had not considered about using surrogateescape.
But surrogateescpae is not silverbullet.
Decode with ascii and encode with target encoding is not valid on ascii
compatible encoding.
In [29]: bindata = b'abc'
In [30]: bindata = bindata.decode('ascii', 'surrogateescape')
In [31]: text = 'abc'
In [32]: query = 'SET textcolumn=%s bincolumn=%s' % ("'" + text + "'", "'"
+ bindata + "'")
In [33]: query.encode('utf16', 'surrogateescape')
Out[33]: b"\xff\xfeS\x00E\x00T\x00
\x00t\x00e\x00x\x00t\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00
\x00b\x00i\x00n\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00"
Fortunately, I can't use utf16 as client encoding with MySQL.
mysql> SET NAMES utf16;
ERROR 1231 (42000): Variable 'character_set_client' can't be set to the
value of 'utf16'
On Wed, Jan 8, 2014 at 9:11 PM, Stephen J. Turnbull
INADA Naoki writes:
I share my experience that I've suffered by bytes doesn't have %-format. `MySQL-python is a most major DB-API 2.0 driver for MySQL. MySQL-python uses 'format' paramstyle.
MySQL protocol is basically encoded text, but it may contain arbitrary (escaped) binary. Here is simplified example constructing real SQL from SQL format and arguments. (Works only on Python 2.7)
'>' quotes are omitted for clarity and comments deleted.
def escape_string(s): return s.replace("'", "''")
def convert(x): if isinstance(x, unicode): x = x.encode('utf-8') if isinstance(x, str): x = "'" + escape_string(x) + "'" else: x = str(x) return x
def build_query(query, *args): if isinstance(query, unicode): query = query.encode('utf-8') return query % tuple(map(convert, args))
textdata = b"hello" bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
I can't port this to Python 3.
Why not? The obvious translation is
# This is Python 3!! def escape_string(s): return s.replace("'", "''")
def convert(x): if isinstance(x, bytes): x = escape_string(x.decode('ascii', errors='surrogateescape')) x = "'" + x + "'" else: x = str(x) return x
def build_query(query, *args): query = query % tuple(map(convert, args)) return query.encode('utf-8', errors='surrogateescape')
textdata = "hello" bindata = b"abc\xff\x00" query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
The main issue I can think you might have with this is that there will need to be conversions to and from 16-bit representations, which take up unnecessary space for bindata, and are relatively slow for bindata. But it seems to me that these are second-order costs compared to the other work an adapter needs to do. What am I missing?
With the proposed 'ascii-compatible' representation, if you have to handle many MB of binary or textdata with non-ASCII characters,
def convert(x): if isinstance(x, str): x = x.encode('utf-8').decode('ascii-compatible') elif isinstance(x, bytes): x = escape_string(x.decode('ascii-compatible')) x = "'" + x + "'" else: x = str(x) # like 42 return x
def build_query(query, *args): query = convert(query) % tuple(map(convert, args)) return query.encode('utf-8', errors='surrogateescape')
ensures that the '%' format operator is always dealing with 8-bit representations only. There might be a conversion from 16-bit to 8-bit for str, but there will be no conversions from 8-bit to 16-bit representations. I don't know if that makes '%' itself faster, but it might.
--
INADA Naoki
On Tue, Jan 07, 2014 at 08:48:05AM -0800, Ethan Furman wrote:
[...] My binary stream is mixed:
- binary that has to be converted (4-byte ints, for example) - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
Ethan, you keep referring to ascii text and encoded text as if they are different things. They're not. You have a binary file containing bytes. Some of those bytes represent data of one kind (say, 4-bit ints). Some of those bytes represent data of a different kind (Latin-1 encoded text representing character and memo fields) and other bytes represent data of a third kind (ASCII encoded text representing ints, but you don't mention what the meaning of those ints is). ASCII or Latin-1, the text is still encoded into bytes, and still needs to be decoded back to text. Since Latin-1 is a superset of ASCII, you could use Latin-1 for them all, and still get the same result. Of course you can't just decode the entire file into Latin-1, since parts of it represent non-text data, but you could decode all the text parts individually using Latin-1 and/or ASCII. (To those reading and wondering how I know the character and memo fields use Latin-1, Ethan has discussed this case on comp.lang.python.) -- Steven
On 01/07/2014 04:39 PM, Steven D'Aprano wrote:
On Tue, Jan 07, 2014 at 08:48:05AM -0800, Ethan Furman wrote:
[...] My binary stream is mixed:
- binary that has to be converted (4-byte ints, for example) - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
Ethan, you keep referring to ascii text and encoded text as if they are different things. They're not.
Would you feel better if I called them ASCII-encoded text, and other-encoded text? And they are different, if for no other reason than they are using different encodings. Further, the ASCII-encoded text can be directly compared with byte sequences because . . . they're bytes! ;)
You have a binary file containing bytes. Some of those bytes represent data of one kind (say, 4-bit ints). Some of those bytes represent data of a different kind (Latin-1 encoded text representing character and memo fields) and other bytes represent data of a third kind (ASCII encoded text representing ints, but you don't mention what the meaning of those ints is).
ASCII-encoded text reprenting ints are ints. I don't know what they mean, but presumably they have something to do with whatever the user named the field. For example, I would imagine that b'35' in an AGE field meant 35 years; luckily I only have to give the user back the integer 35, not figure out what it's supposed to mean.
ASCII or Latin-1, the text is still encoded into bytes, and still needs to be decoded back to text.
No, it doesn't. I don't need to convert b'35' into u'35' to convert to 35. I don't need to convert b'N' to u'N' to know I have a Numeric field, nor b'T' to u'T' to get True. -- ~Ethan~
Ethan Furman writes:
Sounds like it doesn't help me then. My binary stream is mixed:
- binary that has to be converted (4-byte ints, for example) - ascii that has to be converted (ints stored as ascii text) - encoded text (character and memo fields)
and the precise location of each varies from file to file.
Yes, I understand all that, but without code examples (or rather precise specification of the semantics you're implementing) I can't discuss whether my 'ascii-compatible' (the Artist Formerly Known as "7-bit representation") would help you write efficient and readable code. Cf. INADA-san's post for what would help me.
On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:
So ... now that we have the flexible string representation (PEP 393), let's add a 7-bit representation! (Don't take that too seriously, there are interesting more general variants I'm not going to talk about tonight.)
The 7-bit representation satisfies the following requirements:
1. It is only produced on input by a new 'ascii-compatible' codec, which sets the "7-bit representation" flag in the str object on input if it encounters any non-ASCII bytes (if pure ASCII, it produces an 8-bit str object). This will be slower than just reading in the bytes in many cases, but I hope not unacceptably so.
I'm confused by your suggestion here. It seems to me that you've got the conditions backwards. (Or I don't understand them.) Perhaps a couple of examples will make it clear. Suppose we take a pure-ASCII byte-string and decode it: b'abcd'.decode('ascii-compatible') According to the above, this will produce a regular str object, 'abcd', using the regular 8-bit internal representation, and the "7-bit repr" flag cleared. Correct? (So the flag is *cleared* when all the chars in the string are 7-bit, and *set* when at least one is not. Yes?) Suppose we take a byte-string with a non-ASCII byte: b'abc\xFF'.decode('ascii-compatible') This will return... what? I think it returns a so-called 7-bit representation, but I'm not sure what it is a representation of. I presume the internals will actually contain the four bytes 61 62 63 FF and the "7-bit repr" flag will be set. Is that flag the only difference between these two strings? b'abc\xFF'.decode('ascii-compatible') 'abc\xFF' Presumably they will compare equal, yes?
2. When sliced, the result needs to be checked for non-ASCII bytes. If none, the result is promoted to 8-bit.
3. When combined with a str in 8-bit representation:
a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler.
b. Otherwise they're combined into a 7-bit str.
A concrete example: s = b'abcd'.decode('ascii-compatible') t = 'x' # ASCII-compatible s + t => returns 'abcdx', with the "7-bit repr" flag cleared. s = b'abcd'.decode('ascii-compatible') t = 'ÿ' # U+00FF, non-ASCII. s + t => returns 'abcd\uDCFF', with the "7-bit repr" flag set The \uDCFF at the end is the ÿ encoded with the surrogateescape error handler. There's a problem with this: two strings, visually indistinguishable, but differing only in the internal representation, give completely different results: b'abcd'.decode('ascii') + 'ÿ' => 'abcd\u00FF' b'abcd'.decode('ascii-compatible') + 'ÿ' => 'abcd\uDCFF'
4. When combined with a str in 16-bit or 32-bit representation, the 7-bit string is "decoded" to the same representation, as if using the 'ascii' codec with the 'surrogateescape' handler.
Another example: s = b'abcd'.decode('ascii-compatible') assert s = 'abcd' s + 'π' => returns what? Your description confuses me. The "7-bit string" is already text, how do you decode it to the 16-bit internal representation?
5. String methods that would raise or produce undefined results if used on str containing surrogate-encoded bytes need to be taught to do the same on non-ASCII bytes in 7-bit str objects.
Do you have an example of such string methods?
6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else. (Sorry, no, ISO 8859-1 does *not* get passed through without exception.)
7. On output other codecs raise on a 7-bit str, unless the surrogateescape handler is in use.
What do you mean by "on output"? Do you mean when encoding? This concerns me: b'abcd'.decode('ascii').encode('latin-1') => returns b'abcd' b'abcd'.decode('ascii-compatible').encode('latin-1') => raises And yet, the two 'abcd' strings you get are visually indistinguishable, and only differ by a hidden, internal flag. I've probably misunderstood something about your proposal, so please explain where I've gone wrong. Please give examples! -- Steven
On 7 Jan 2014 23:45, "Steven D'Aprano"
On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:
So ... now that we have the flexible string representation (PEP 393), let's add a 7-bit representation! (Don't take that too seriously, there are interesting more general variants I'm not going to talk about tonight.)
The 7-bit representation satisfies the following requirements:
1. It is only produced on input by a new 'ascii-compatible' codec, which sets the "7-bit representation" flag in the str object on input if it encounters any non-ASCII bytes (if pure ASCII, it produces an 8-bit str object). This will be slower than just reading in the bytes in many cases, but I hope not unacceptably so.
I'm confused by your suggestion here. It seems to me that you've got the conditions backwards. (Or I don't understand them.) Perhaps a couple of examples will make it clear.
Suppose we take a pure-ASCII byte-string and decode it:
b'abcd'.decode('ascii-compatible')
According to the above, this will produce a regular str object, 'abcd', using the regular 8-bit internal representation, and the "7-bit repr" flag cleared. Correct? (So the flag is *cleared* when all the chars in the string are 7-bit, and *set* when at least one is not. Yes?)
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
This will return... what? I think it returns a so-called 7-bit representation, but I'm not sure what it is a representation of. I presume the internals will actually contain the four bytes
61 62 63 FF
and the "7-bit repr" flag will be set. Is that flag the only difference between these two strings?
b'abc\xFF'.decode('ascii-compatible') 'abc\xFF'
Presumably they will compare equal, yes?
2. When sliced, the result needs to be checked for non-ASCII bytes. If none, the result is promoted to 8-bit.
3. When combined with a str in 8-bit representation:
a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler.
b. Otherwise they're combined into a 7-bit str.
A concrete example:
s = b'abcd'.decode('ascii-compatible') t = 'x' # ASCII-compatible s + t => returns 'abcdx', with the "7-bit repr" flag cleared.
s = b'abcd'.decode('ascii-compatible') t = 'ÿ' # U+00FF, non-ASCII.
s + t => returns 'abcd\uDCFF', with the "7-bit repr" flag set
The \uDCFF at the end is the ÿ encoded with the surrogateescape error handler.
There's a problem with this: two strings, visually indistinguishable, but differing only in the internal representation, give completely different results:
b'abcd'.decode('ascii') + 'ÿ' => 'abcd\u00FF'
b'abcd'.decode('ascii-compatible') + 'ÿ' => 'abcd\uDCFF'
4. When combined with a str in 16-bit or 32-bit representation, the 7-bit string is "decoded" to the same representation, as if using the 'ascii' codec with the 'surrogateescape' handler.
Another example:
s = b'abcd'.decode('ascii-compatible') assert s = 'abcd' s + 'π' => returns what?
Your description confuses me. The "7-bit string" is already text, how do you decode it to the 16-bit internal representation?
5. String methods that would raise or produce undefined results if used on str containing surrogate-encoded bytes need to be taught to do the same on non-ASCII bytes in 7-bit str objects.
Do you have an example of such string methods?
6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else. (Sorry, no, ISO 8859-1 does *not* get passed through without exception.)
7. On output other codecs raise on a 7-bit str, unless the surrogateescape handler is in use.
What do you mean by "on output"? Do you mean when encoding?
This concerns me:
b'abcd'.decode('ascii').encode('latin-1') => returns b'abcd'
b'abcd'.decode('ascii-compatible').encode('latin-1') => raises
And yet, the two 'abcd' strings you get are visually indistinguishable, and only differ by a hidden, internal flag.
I've probably misunderstood something about your proposal, so please explain where I've gone wrong. Please give examples!
I haven't been following the discussion in detail (linux.conf.au and the Py3 discussions have most of my attention this week), but I'm definitely not clear on how this 7-bit proposal differs meaningfully from just using ascii with the surrogateescape error handler. Cheers, Nick.
-- Steven _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Nick Coghlan writes:
I haven't been following the discussion in detail (linux.conf.au and the Py3 discussions have most of my attention this week), but I'm definitely not clear on how this 7-bit proposal differs meaningfully from just using ascii with the surrogateescape error handler. Cheers, Nick.
It doesn't differ meaningfully to me. I doubt I'll be writing any programs in the near future that aren't just as well and efficiently done by decoding as ascii with surrogateescape. It does give you an 8-bit representation, with the benefits that gives you (very fast encode and fast decode), whereas the ascii + surrogateescape approach gives you a 16-bit representation sometimes. Some people seem to care about that, eg, it seems to fit the chunked HTTP use-case perfectly. It gives you an 8-bit almost-bytes type without the b prefix on literals. I don't know if that would actually be useful to anybody. Finally (and again, I haven't thought this through) you have a halfway house that can in principle be mixed more or less freely with either bytes (and bytearray and memoryview) or Unicode, but not with both. (There is intentionally no way to get back to "ascii-compatible" representation from one of the other str representations, and in the same way combining with one of the bytes types would give a bytes type.) I realize this probably doesn't work without modification because as designed it *is* str and the type system wouldn't be able to distinguish between the ascii-compatible representation and a str in another representation. So maybe this would bring us back to the idea of a new bytestring type. I'll get back to Steven's post later, but it and others seem to be stuck in the greylist. (Hate spam, hate spam, hate what spam does to us....)
I think Stephen's name "7-bit" is confusing people. If you try to interpret the name sensibly, you get Steven's broken interpretation. But if you read it as a nonsense word and work through the logic, it all makes sense.
On Jan 7, 2014, at 7:44, Steven D'Aprano
On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:
So ... now that we have the flexible string representation (PEP 393), let's add a 7-bit representation! (Don't take that too seriously, there are interesting more general variants I'm not going to talk about tonight.)
The 7-bit representation satisfies the following requirements:
1. It is only produced on input by a new 'ascii-compatible' codec, which sets the "7-bit representation" flag in the str object on input if it encounters any non-ASCII bytes (if pure ASCII, it produces an 8-bit str object). This will be slower than just reading in the bytes in many cases, but I hope not unacceptably so.
I'm confused by your suggestion here. It seems to me that you've got the conditions backwards. (Or I don't understand them.) Perhaps a couple of examples will make it clear.
Suppose we take a pure-ASCII byte-string and decode it:
b'abcd'.decode('ascii-compatible')
According to the above, this will produce a regular str object, 'abcd', using the regular 8-bit internal representation, and the "7-bit repr" flag cleared. Correct? (So the flag is *cleared* when all the chars in the string are 7-bit, and *set* when at least one is not. Yes?)
Correct. The floobl representation is not used because there are no non-ASCII bytes.
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
This will return... what? I think it returns a so-called 7-bit representation, but I'm not sure what it is a representation of.
The representation is the bytes 61 62 63 FF with the floobl flag set. It's a representation of an 'a' char, a 'b' char, a 'c' char, and a smuggled FF byte--identical to 'abc\uDCFF'. (This last bit is the part I'm a bit wary of, as it promoted surrogate-escape to being an inherent part of the meaning of Unicode strings in Python. But maybe Stephen has an answer for that. And anyway, it's a much smaller problem than the one you think is there.)
I presume the internals will actually contain the four bytes
61 62 63 FF
and the "7-bit repr" flag will be set. Is that flag the only difference between these two strings?
b'abc\xFF'.decode('ascii-compatible') 'abc\xFF'
The floobl flag is the only difference between the two internal representations, but there's a big difference in the meaning.
Presumably they will compare equal, yes?
I would hope not. One of them has the Unicode character U+FF, the other has smuggled byte 0xFF, so they'd better not compare equal. However, the latter should compare equal to 'abc\uDCFF'. That's the entire key here: the new representation is nothing but a more compact way to represent strings that contain nothing but ASCII and surrogate escapes.
2. When sliced, the result needs to be checked for non-ASCII bytes. If none, the result is promoted to 8-bit.
3. When combined with a str in 8-bit representation:
a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler.
b. Otherwise they're combined into a 7-bit str.
A concrete example:
s = b'abcd'.decode('ascii-compatible') t = 'x' # ASCII-compatible s + t => returns 'abcdx', with the "7-bit repr" flag cleared.
Right. Here both s and t are normal 8-bit strings reprs in the first place, so the new logic doesn't even get invoked. So yes, that's what it returns.
s = b'abcd'.decode('ascii-compatible') t = 'ÿ' # U+00FF, non-ASCII.
s + t => returns 'abcd\uDCFF', with the "7-bit repr" flag set
No, you've missed two key bits here. First, you're again adding two regular 8-bit-repr strings, not a non-ASCII-smuggling string plus an 8-bit, so the new logic doesn't get invoked at all. Plus, even if s were a 7-bit-flagged string like 'ab\xfe'.decode('ascii-compatible'), that wouldn't turn t into \uDCFF. Only bytes in the floobl-flagged string are surrogate-escaped; characters in the normal string are handled normally. So you'd have 'ab\uDCFE\xFF'. Also, both strings are promoted to 16-bit, and the floobl flag is never set with 16-bit or 32-bit representations.
The \uDCFF at the end is the ÿ encoded with the surrogateescape error handler.
There's a problem with this: two strings, visually indistinguishable, but differing only in the internal representation, give completely different results:
b'abcd'.decode('ascii') + 'ÿ' => 'abcd\u00FF'
b'abcd'.decode('ascii-compatible') + 'ÿ' => 'abcd\uDCFF'
Nope, again, these both give the first result.
4. When combined with a str in 16-bit or 32-bit representation, the 7-bit string is "decoded" to the same representation, as if using the 'ascii' codec with the 'surrogateescape' handler.
Another example:
s = b'abcd'.decode('ascii-compatible') assert s = 'abcd' s + 'π' => returns what?
'abcdπ'. Since the first one is a plain 8-bit string, and the second a plain 16-bit string, the new logic never even gets involved. And again, if you change this so s is b'abc\xFE'.decode('ascii-compatible'), then you're adding a floobl string and a 16-bit string, so the FE byte gets encoded as DCFE, while the pi character is left unchanged, so you get 'abc\uDCFEπ'.
Your description confuses me. The "7-bit string" is already text, how do you decode it to the 16-bit internal representation?
By decoding its representation as if it were bytes, using surrogate-escape.
5. String methods that would raise or produce undefined results if used on str containing surrogate-encoded bytes need to be taught to do the same on non-ASCII bytes in 7-bit str objects.
Do you have an example of such string methods?
6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else. (Sorry, no, ISO 8859-1 does *not* get passed through without exception.)
7. On output other codecs raise on a 7-bit str, unless the surrogateescape handler is in use.
What do you mean by "on output"? Do you mean when encoding?
Presumably "output" means something like writing to a TextIOWrapper whose encoding whose codec is ascii-compatible. In which case you're right, it would be clearer to just say "when encoding". However, I think there's a mistake in the design of 6 here. Surely encoding 'abc\uDCFF' should give you the bytes 61 62 63 FF, not an exception, right? (Unless the idea is that such a string is guaranteed to have a floobl-flagged 8-bit representation, not a 16-bit one, no matter how you try to create it in Python or in C, and I don't think the other rules make that guarantee.)
This concerns me:
b'abcd'.decode('ascii').encode('latin-1') => returns b'abcd'
b'abcd'.decode('ascii-compatible').encode('latin-1') => raises
Nope. The decoding returns the string 'abcd', in normal 8-bit representation, in both cases. There are no non-ASCII bytes, so the floobl flag isn't set. So you get the same result either way.
And yet, the two 'abcd' strings you get are visually indistinguishable, and only differ by a hidden, internal flag.
I've probably misunderstood something about your proposal, so please explain where I've gone wrong. Please give examples!
-- Steven _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I'm responding here rather than directly to Steven because Andrew explains it as well as I could. In all cases where I don't comment, Andrew is 100% correct as to my intended semantics. The critical point is just that in cases where "the ASCII characters are themselves" and an 8-bit representation is theoretically possible, an 8-bit representation is used. More precisely, if the identities of 128-255 as characters is not important to the programmer, these bytes are not interpreted as characters, in the same way that surrogate- escaped bytes are uninterpreted in the current representation. Andrew Barnert writes:
I think Stephen's name "7-bit" is confusing people.
Indeed, and I apologize for confusing Steven in particular, which is entirely due to that poor choice.
If you try to interpret the name sensibly, you get Steven's broken interpretation. But if you read it as a nonsense word and work through the logic, it all makes sense.
Maybe "ascii-compatible" is better. It's a union type, including all encodings where octets 0-127 receive the standard mapping to the ASCII characters, but octets 128-255 are ambiguous.
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
This will return... what? I think it returns a so-called 7-bit representation, but I'm not sure what it is a representation of.
The representation is the bytes 61 62 63 FF with the floobl flag set. It's a representation of an 'a' char, a 'b' char, a 'c' char, and a smuggled FF byte--identical to 'abc\uDCFF'.
Except that it's an 8-bit representation invisible to Python except for maybe the timeit package, yes.
(This last bit is the part I'm a bit wary of, as it promoted surrogate-escape to being an inherent part of the meaning of Unicode strings in Python.
They're already part of the inherent meaning of Unicode strings. The alternative is to read ASCII-compatible streams as latin1, which *changes their meaning*.
Your description confuses me. The "7-bit string" is already text, how do you decode it to the 16-bit internal representation?
By decoding its representation as if it were bytes, using surrogate-escape.
Strictly speaking, it's not a "decoding", it's a change of internal representation.
5. String methods that would raise or produce undefined results if used on str containing surrogate-encoded bytes need to be taught to do the same on non-ASCII bytes in 7-bit str objects.
Do you have an example of such string methods?
No, I don't, but I imagined there might be some. (My original example was case conversion, but that doesn't work because Python doesn't check for whether something is actually a code point that can be a character, even -- it just notices that surrogate-encoded bytes don't have alternative cases in the database and passes them through.)
7. On output other codecs raise on a 7-bit str, unless the surrogateescape handler is in use.
What do you mean by "on output"? Do you mean when encoding?
Yes. You (all, but Steven in particular) have my apology for the imprecision.
However, I think there's a mistake in the design of 6 here. Surely encoding 'abc\uDCFF' should give you the bytes 61 62 63 FF, not an exception, right? (Unless the idea is that such a string is guaranteed to have a floobl-flagged 8-bit representation, not a 16-bit one, no matter how you try to create it in Python or in C, and I don't think the other rules make that guarantee.)
Andrew is correct, that is a mistake in design. I thought an 8-bit representation was guaranteed in that case, with the "floobl" flag set. I think that Andrew's idea is correct, but this miss makes me nervous about the coherence of the concept.
I think there are three problems with your proposal--all of which I mentioned in the long reply to Steven, but I suspect many people tl;dr'd over that, and I like your proposal enough that I want to make sure either I'm wrong, or you fix them. So:
On Jan 6, 2014, at 10:37, "Stephen J. Turnbull"
So ... now that we have the flexible string representation (PEP 393), let's add a 7-bit representation!
The name has confused both Steven and Nick into misinterpreting the idea, and it confused me until I read over the details twice and it finally clicked, and it still doesn't make sense after I understand what you mean. This is an 8-bit representation where non-ASCII bytes are used to smuggle non-ASCII bytes. Just like the existing 16-bit representation where surrogate escapes are used to smuggle non-ASCII bytes. It's not a 7-bit representation unless there's nothing but ASCII in it--and it's never used in the case where there's nothing but ASCII. I'm not sure what the right word is, but this isn't it.
1. It is only produced on input by a new 'ascii-compatible' codec,
This name might also be confusing people.
3. When combined with a str in 8-bit representation:
a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler.
This part worries me a bit. The bytes 61 62 63 FF in this new representation actually _mean_ 'abc' followed by a smuggled FF byte. But the words 0061 0062 0063 DCFF in a 16-bit representation just mean 'abc\uDCFF', which _can be interpreted_, via the surrogate-escape mechanism, as 'abc' and a smuggled byte, but don't actually _mean_ that. It seems like your proposal only works if we change it so that they really _do_ mean that.
6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else.
So if a 7-bit string gets converted to a surrogate-escaped 16-bit string, it can never be written out again? For a contrived example: (b'abc\xff'.decode('ascii-compatible') + '\u1234')[:4].encode('ascii-compatible') I'd expect to get back my b'abcd\xff'. But your rules give me an exception. Maybe you were expecting this to be taken care of in the slicing, but rule 1 makes that impossible; you can never get a 7-bit string by doing anything but decoding ascii-compatible (or combining two 7-bit strings). I think ascii-compatible has to accept non-8-bit-repr strings (by encoding ASCII as ASCII and surrogate escapes as bytes and everything else is an exception). This is necessary because 60 61 62 FF (7-bit) and 0061 0062 0063 DCFF (16-bit) are the same string anyway. But it's especially necessary because the former can be silently converted into the latter (and there's no way to even test whether that's happened). Of course that means biting the bullet and saying that \uDCFF in python really means a smuggled FF byte, rather than just being a way to smuggle an FF byte through Unicode if want to you do so explicitly. But as I said above, I think you've already bitten that bullet.
Andrew Barnert writes:
a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler.
This part worries me a bit. The bytes 61 62 63 FF in this new representation actually _mean_ 'abc' followed by a smuggled FF byte.
No, it doesn't. It means 'abc' followed by something that cannot be encoded by any codec without the surrogateescape handler. 'ascii-compatible' merely defaults to that handler. I wouldn't actually be too upset if I were told, no, you have to specify explicitly.
6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else.
So if a 7-bit string gets converted to a surrogate-escaped 16-bit string, it can never be written out again?
Of course it can. Use .encode('ascii', errors='surrogateescape')
(b'abc\xff'.decode('ascii-compatible') + '\u1234')[:4].encode('ascii-compatible')
I'd expect to get back my b'abcd\xff'. But your rules give me an exception.
Yes. This whole proposal was aimed at wire protocols. It's very bad if something intended to be ready to be squirted into the wire needs (expensive) encoding.
I think ascii-compatible has to accept non-8-bit-repr strings (by encoding ASCII as ASCII and surrogate escapes as bytes and everything else is an exception). This is necessary because 60 61 62 FF (7-bit) and 0061 0062 0063 DCFF (16-bit) are the same string anyway. But it's especially necessary because the former can be silently converted into the latter (and there's no way to even test whether that's happened).
Well, one way around that would be to require that the latter not exist (convert it to "7-bit" during construction). But I've come to the conclusion that this is all too irregular and confusing. I'm pretty sure that I can come up with a set of rules that are not inherently self-contradictory, but I'm also pretty sure that the resulting type will behave unintuitively for almost everybody. Also, despite my original thought, it's really hard to see how unnecessary encode/decode cycles can be eliminated. So I think I need to go back to the drawing board. So I hope I haven't wasted too much of your time; it's been very educational for me.
Stephen J. Turnbull wrote:
No, it doesn't. It means 'abc' followed by something that cannot be encoded by any codec without the surrogateescape handler. 'ascii-compatible' merely defaults to that handler. I wouldn't actually be too upset if I were told, no, you have to specify explicitly.
If I understand correctly, your intention is that 61 62 63 FF in this representation would simply be a more compact version of 0061 0062 0063 DCFF, with exactly the same semantics. If that's right, then maybe something like "compressed surrogateescape" or "8-bit surrogateescape" would be a better name for it? Also, it could be produced automatically where possible by any decoding operation that specified surrogateescape -- there wouldn't have to be a dedicated encoding name for it (although there could be for convenience). It could also potentially be produced by any slicing or other string operations that resulted in characters within the appropriate ranges, just like any of the other internal representations. -- Greg
Greg Ewing writes:
If I understand correctly, your intention is that 61 62 63 FF in this representation would simply be a more compact version of 0061 0062 0063 DCFF, with exactly the same semantics.
Pretty much so. There remain some ambiguities and questions about efficient implementability in my mind.
If that's right, then maybe something like "compressed surrogateescape" or "8-bit surrogateescape" would be a better name for it?
Maybe. Thanks for the suggestion! However, as I mentioned already I'm going to back off on this for a while, because in the process of analyzing Inada-san's use case I realized that by itself it doesn't save much besides space, and isn't pretty too boot.
Geert Jansen writes:
One use case I came across was when creating chunks for the HTTP chunked encoding. Chunks contain a ascii header, a raw/encoded chunk body, and an ascii trailer. Using a bytes.format, it would look like this:
chunk = '{0:X}\r\n{1}\r\n'.format(len(buf), buf)
You forgot the b prefix.
This is what I am using now:
chunk = bytearray() chunk.extend('{0:X}\r\n'.format(len(buf)).encode('ascii')) chunk.extend(buf) chunk.extend('\r\n'.encode('ascii'))
Either of those is a big win compared to this? # OK, we'd want efficient definition of a bunch of these, # which is a cost. def itox (n): return '{0:X}'.format(n).encode('ascii') chunk = b'\r\n'.join([itox(len(buf)), buf, b'']) But see my response to Andrew, also.
On 2014-01-06, at 11:57 , Stephen J. Turnbull
Geert Jansen writes:
I'm not missing a new type, but I am missing the format method on the binary types.
I'm curious about precisely what your use cases are, and just what formatting they need.
Building up protocol output, especially (but not solely) ascii-based ones, from existing or computed parts. Basically the same reasons behind Erlang's bit syntax (on the building side thereof): http://www.erlang.org/doc/programming_examples/bit_syntax.html Essentially a partial and more readable (especially more readable) version of what `struct` provides, and one in which the "pattern" can contain literal constant content. `struct` is nice, but it doesn't scale very well to big binary creation, and it's fairly horrible when part of the output is constant as constant parts *still* have to be patterned and injected as parameters. Also, no support for keyword arguments.
From: Geert Jansen
I'm not missing a new type, but I am missing the format method on the binary types.
I miss that too, but it's a bit tricky. '{}'.format(x) calls str(x). b'{}'.format(x) can't call bytes(x). At least not unless you want b'#{}'.format(6) to give you b'#\0\0\0\0\0\0'. Besides, most types don't provide a __bytes__, so even if it weren't for this problem, it wouldn't really be useful for anything except inserting bytes into other bytes. So, what _should_ it call? You could add encoding and errors keyword parameters (defaulting to 'ascii' and 'strict'), so b'{}'.format(x, encoding='utf-8') calls str(x).encode('utf-8'), which solves all of those problems… except that now it means you can't stick bytes objects into bytes formats, which is even worse. You could solve that by making objects that support the buffer protocol (like bytes) copy as-is instead of going through str and encode. That would mean you can't use bytes with a placeholder with any format flags, but maybe that's a good thing anyway (e.g., do you really want b'{:3}'.format(b'\xc3\xa9') to only pad to 2 characters instead of 3 because it's a 2-byte character?). That would be enough to let you cram pre-encoded/formatted bytes, and things like numbers, into bytes formats made up of ASCII headers, which I think is 90% of what people want here. Does that seem worth pursuing?
On Mon, Jan 6, 2014 at 12:34 PM, Andrew Barnert
b'{}'.format(x) can't call bytes(x). At least not unless you want b'#{}'.format(6) to give you b'#\0\0\0\0\0\0'. Besides, most types don't provide a __bytes__, so even if it weren't for this problem, it wouldn't really be useful for anything except inserting bytes into other bytes. So, what _should_ it call?
You could add encoding and errors keyword parameters (defaulting to 'ascii' and 'strict'), so b'{}'.format(x, encoding='utf-8') calls str(x).encode('utf-8'), which solves all of those problems… except that now it means you can't stick bytes objects into bytes formats, which is even worse.
You could solve that by making objects that support the buffer protocol (like bytes) copy as-is instead of going through str and encode. That would mean you can't use bytes with a placeholder with any format flags, but maybe that's a good thing anyway (e.g., do you really want b'{:3}'.format(b'\xc3\xa9') to only pad to 2 characters instead of 3 because it's a 2-byte character?).
That would be enough to let you cram pre-encoded/formatted bytes, and things like numbers, into bytes formats made up of ASCII headers, which I think is 90% of what people want here. Does that seem worth pursuing?
Agreed that probably the main case is inserting bytes objects verbatim in a message with a a small ASCII header and possibly trainer. Format flags are useful, e.g. with chunked HTTP encoding you need to insert the length in hex. But if those are only available for non-bytes objects that'd probably be fine. I'm not too familiar with the implementation of format() so I can't say much about it. Regards, Geert
On 06/01/2014 08:28, Geert Jansen wrote:
On Sun, Jan 5, 2014 at 8:33 PM, Ethan Furman
wrote: As anyone who has worked with Python 3 and low-level protocols knows, Python 3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
I'm not missing a new type, but I am missing the format method on the binary types.
Regards, Geert
Is this what the new PEP 460 is aimed at or am I again barking in the wrong forest? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On 6 Jan 2014 21:58, "Mark Lawrence"
On 06/01/2014 08:28, Geert Jansen wrote:
On Sun, Jan 5, 2014 at 8:33 PM, Ethan Furman
wrote: As anyone who has worked with Python 3 and low-level protocols knows,
Python
3 has no 'bytestring' type. It has immutable and mutable versions of arrays of integers, otherwise known as 'bytes' and 'bytearray'.
How many would be interested in having a 'bytestring'?
I'm not missing a new type, but I am missing the format method on the binary types.
Regards, Geert
Is this what the new PEP 460 is aimed at or am I again barking in the wrong forest?
Yep, parallel discussions. Cheers, Nick.
-- My fellow Pythonistas, ask not what our language can do for you, ask what
you can do for our language.
Mark Lawrence
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Jan 6, 2014 at 7:53 PM, Mark Janssen
How many would be interested in having a 'bytestring'?
I'm not missing a new type, but I am missing the format method on the binary types.
Wouldn't a type "cast" like TextFile(bytestring) be sufficient?
Unless I'm missing something, no. For the use case described the result needs to be a bytes object. Regards, Geert
participants (16)
-
Amber Yust
-
Andrew Barnert
-
Antoine Pitrou
-
Cameron Simpson
-
Ethan Furman
-
Geert Jansen
-
Greg Ewing
-
INADA Naoki
-
Mark Janssen
-
Mark Lawrence
-
Masklinn
-
Ned Batchelder
-
Nick Coghlan
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Łukasz Langa