Adding 'bytes' as alias for 'latin_1' codec.
Hi, all. There are some situation that I want to use bytes as a string in real world. (I use the 'bstr' for bytes as a string below) Sadly, Python 3's bytes is not bytestring. For example, when I want to make 'cat -n' that is transparent to encoding, Python 3 doesn't permit b'{0:6d}'.format(n) and '{0:6d}'.format(n).encode('ascii') is circuitous way against simple requirements. I think the best way to handle such situation with Python 3 is using 'latin1' codec. For example, encoding transparent 'cat -n' is: import sys fin = open(sys.stdin.fileno(), 'r', encoding='latin1') fout = open(sys.stdout.fileno(), 'w', encoding='latin1') for n, L in enumerate(fin): fout.write('{0:5d}\t{1}'.format(n, L)) If using 'latin1' is Pythonic way to handle encoding transparent string, I think Python should provide another alias like 'bytes'. Any thoughts? -- INADA Naoki <songofacandy@gmail.com>
On 5/25/2011 1:29 PM, INADA Naoki wrote:
Sadly, Python 3's bytes is not bytestring.
By intention.
import sys fin = open(sys.stdin.fileno(), 'r', encoding='latin1') fout = open(sys.stdout.fileno(), 'w', encoding='latin1') for n, L in enumerate(fin): fout.write('{0:5d}\t{1}'.format(n, L))
If using 'latin1' is Pythonic way to handle encoding transparent string, I think Python should provide another alias like 'bytes'.
I presume that you mean you would like to write fin = open(sys.stdin.fileno(), 'r', encoding='bytes') fout = open(sys.stdout.fileno(), 'w', encoding='bytes') If such a thing were added, the 256 bytes should directly map to the first 256 codepoints. I don't know if 'latin1' does that or not. In any case, one can rewrite the above without decoding input lines. with open('tem.py', 'rb') as fin, open('tem2.txt', 'wb') as fout: for n, L in enumerate(fin): fout.write('{0:5d}\t'.format(n).encode('ascii')) fout.write(L) (sys.x.fineno raises fineno AttributeError in IDLE.) -- Terry Jan Reedy
On Thu, May 26, 2011 at 10:58 AM, Terry Reedy <tjreedy@udel.edu> wrote:
On 5/25/2011 1:29 PM, INADA Naoki wrote:
Sadly, Python 3's bytes is not bytestring.
By intention.
Yes, I know. But I feel sad because it cause many confusions. Bytes supports some string methods.
b"foo".capitalize() # Oh, b'Foo' b"foo".isalpha() # alphabets in not-string? True b"foo%d" % 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for %: 'bytes' and 'int'
import sys fin = open(sys.stdin.fileno(), 'r', encoding='latin1') fout = open(sys.stdout.fileno(), 'w', encoding='latin1') for n, L in enumerate(fin): fout.write('{0:5d}\t{1}'.format(n, L))
If using 'latin1' is Pythonic way to handle encoding transparent string, I think Python should provide another alias like 'bytes'.
I presume that you mean you would like to write fin = open(sys.stdin.fileno(), 'r', encoding='bytes') fout = open(sys.stdout.fileno(), 'w', encoding='bytes')
If such a thing were added, the 256 bytes should directly map to the first 256 codepoints. I don't know if 'latin1' does that or not. In any case,
Yes, 'latin1' directly maps 256 bytes to 256 codepoints.
one can rewrite the above without decoding input lines.
with open('tem.py', 'rb') as fin, open('tem2.txt', 'wb') as fout: for n, L in enumerate(fin): fout.write('{0:5d}\t'.format(n).encode('ascii')) fout.write(L)
(sys.x.fineno raises fineno AttributeError in IDLE.)
There are 2 problems. 1) binary mode doesn't support line buffering. So I should disable buffering and this may cause performance regression. 2) Requiring .encode('ascii') is less attractive when using Python as a scripting language in Unix. But latin1 approach has disadvantage of performance and memory usage. I think Python 3 doesn't provide easy and efficient way to implement encoding transparent command like 'cat -n'. It's very sad. -- INADA Naoki <songofacandy@gmail.com>
On 5/25/2011 10:57 PM, INADA Naoki wrote:
Bytes supports some string methods. As exactly specified in 4.6.5. Bytes and Byte Array Methods There is really no need to repeat what everyone reading this knows.
I wrote
with open('tem.py', 'rb') as fin, open('tem2.txt', 'wb') as fout: for n, L in enumerate(fin): fout.write('{0:5d}\t'.format(n).encode('ascii')) fout.write(L)
(sys.x.fineno raises fineno AttributeError in IDLE.)
There are 2 problems.
1) binary mode doesn't support line buffering. So I should disable buffering and this may cause performance regression.
*nix already has a c-coded cat command; Windows has copy commands. So there is no need to design Python for this. Cat is usually used with files rather than terminals ans screens. When it is used with terminals and screens, the extra encode/decode does not matter. Realistic Python programs that actually do something with the text need to decode with the actual encoding, regardless of byte source. So I do not think we need a bytes alias for latin_1. The docs might mention that it is essentially a do-nothing codec. -- Terry Jan Reedy
On Thu, May 26, 2011 at 3:29 AM, INADA Naoki <songofacandy@gmail.com> wrote:
There are some situation that I want to use bytes as a string in real world.
Breaking the bytes-are-text mental model is something we deliberately set out to do with Python 3 (because it is wrong). In today's global environment, programmers *need* to learn about text encoding issues as treating bytes as text without finding out the encoding first is a surefire way to get unintelligible mojibake. If "What does 'latin-1' mean?" is a question that gets them there, then that's fine. You *cannot* transparently handle data in arbitrary encodings, as the meanings of the bytes change based on the encoding (this is especially true when dealing with non-ASCII compatible encodings). That said, decoding and reencoding via 'ascii' (strict 7-bit) or 'latin-1' (full 8-bit) is the easiest way to handle both strings and bytes input reasonably efficiently. See urllib.parse for examples on how to do that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Terry Reedy, 26.05.2011 03:58:
If such a thing were added, the 256 bytes should directly map to the first 256 codepoints. I don't know if 'latin1' does that or not.
Yes, Unicode was specifically designed to support that. The first 128 code points are identical with the ASCII encoding, the first 256 code points are identical with the Latin-1 encoding. See also PEP 393, which exploits this feature. http://www.python.org/dev/peps/pep-0393/ That being said, I don't see the point of aliasing "latin-1" to "bytes" in the codecs. That sounds confusing to me. Stefan
On Wed, May 25, 2011 at 11:15 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Yes, Unicode was specifically designed to support that. The first 128 code points are identical with the ASCII encoding, the first 256 code points are identical with the Latin-1 encoding.
See also PEP 393, which exploits this feature.
http://www.python.org/dev/peps/pep-0393/
That being said, I don't see the point of aliasing "latin-1" to "bytes" in the codecs. That sounds confusing to me.
"bytes" is probably the wrong name for it, but I think using some name to signal "I'm not really using this encoding, I just need to be able to pass these bytes into and out of a string without losing any bits" might be better than using "latin-1" if we're forced to take up this hack. (My gut feeling is that it would be better if we could avoid using the "latin-1" hack all together, but apparently wiser minds than me have decided we have no other choice.) Maybe we could call it "passthrough"? And we could add a documentation note that if you use "passthrough" to decode some bytes you must, must, must use it to encode them later, since the string you manipulate won't really contain unicode codepoints, just a transparent byte encoding… -- Carl
Carl M. Johnson wrote:
On Wed, May 25, 2011 at 11:15 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Yes, Unicode was specifically designed to support that. The first 128 code points are identical with the ASCII encoding, the first 256 code points are identical with the Latin-1 encoding.
See also PEP 393, which exploits this feature.
http://www.python.org/dev/peps/pep-0393/
That being said, I don't see the point of aliasing "latin-1" to "bytes" in the codecs. That sounds confusing to me.
"bytes" is probably the wrong name for it, but I think using some name to signal "I'm not really using this encoding, I just need to be able to pass these bytes into and out of a string without losing any bits" might be better than using "latin-1" if we're forced to take up this hack. (My gut feeling is that it would be better if we could avoid using the "latin-1" hack all together, but apparently wiser minds than me have decided we have no other choice.) Maybe we could call it "passthrough"? And we could add a documentation note that if you use "passthrough" to decode some bytes you must, must, must use it to encode them later, since the string you manipulate won't really contain unicode codepoints, just a transparent byte encoding…
If you really wish to carry around binary data in a Unicode object, then you should use a codec that maps the 256 code points in a byte to either a private code point area or use a hack like the surrogateescape approach defined in PEP 383: http://www.python.org/dev/peps/pep-0383/ By using 'latin-1' you can potentially have the binary data leak into other text data of your application, or worse, have it converted to a different encoding on output, e.g. when sending the data to a UTF-8 pipe. In any case, this is bound to create hard to detect problems. Better use bytes to begin with. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 26 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 25 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 2011-05-26, at 12:59 , Carl M. Johnson wrote:
On Wed, May 25, 2011 at 11:15 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Yes, Unicode was specifically designed to support that. The first 128 code points are identical with the ASCII encoding, the first 256 code points are identical with the Latin-1 encoding.
See also PEP 393, which exploits this feature.
http://www.python.org/dev/peps/pep-0393/
That being said, I don't see the point of aliasing "latin-1" to "bytes" in the codecs. That sounds confusing to me.
"bytes" is probably the wrong name for it, but I think using some name to signal "I'm not really using this encoding, I just need to be able to pass these bytes into and out of a string without losing any bits" might be better than using "latin-1" if we're forced to take up this hack. (My gut feeling is that it would be better if we could avoid using the "latin-1" hack all together, but apparently wiser minds than me have decided we have no other choice.) Maybe we could call it "passthrough"? And we could add a documentation note that if you use "passthrough" to decode some bytes you must, must, must use it to encode them later, since the string you manipulate won't really contain unicode codepoints, just a transparent byte encoding…
Considering the original use case, which seems to be mostly about being able to use .format, would it make more sense to be able to create "byte patterns", with formats similar to those of str.format but not identical (e.g. better control on layout would be nice, something similar to Erlang's bit syntax for putting binaries together). This would be useful to put together byte sequences from existing values to e.g. output binary formats.
On Thu, May 26, 2011 at 9:17 PM, Masklinn <masklinn@masklinn.net> wrote:
Considering the original use case, which seems to be mostly about being able to use .format, would it make more sense to be able to create "byte patterns", with formats similar to those of str.format but not identical (e.g. better control on layout would be nice, something similar to Erlang's bit syntax for putting binaries together).
This would be useful to put together byte sequences from existing values to e.g. output binary formats.
We already have an entire module dedicated to the task of handling binary formats: http://docs.python.org/py3k/library/struct "format(n, '6d').encode('ascii')" is the right way to get the string representation of a number as ASCII bytes. However, the programmer needs to be aware that concatenating those bytes with an encoding that is not ASCII compatible (such as UTF-16, UTF-32, or many of the Asian encodings) will result in a sequence of unusable garbage. It is far, far safer to transform everything into the text domain, work with it there, then encode back when the manipulation is complete. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 2011-05-26, at 16:55 , Nick Coghlan wrote:
On Thu, May 26, 2011 at 9:17 PM, Masklinn <masklinn@masklinn.net> wrote:
Considering the original use case, which seems to be mostly about being able to use .format, would it make more sense to be able to create "byte patterns", with formats similar to those of str.format but not identical (e.g. better control on layout would be nice, something similar to Erlang's bit syntax for putting binaries together).
This would be useful to put together byte sequences from existing values to e.g. output binary formats.
We already have an entire module dedicated to the task of handling binary formats: http://docs.python.org/py3k/library/struct Sure, but:
1. It does not matter overly much, there are many cases where this did not stop the core team from agreeing the problem was insufficiently well solved (latest instance: string formatting, the current builtin solution being predated by an other builtin and at least one previous stdlib solution) 2. struct suffers from a bunch of issues - it ranks low in discoverability, people who have not bit-twiddled much in C may not realize that a struct (in C) is just an interpretation pattern on a byte string, and it's advertised as an interaction between Python and C structs, not arbitrary bytes patterns/building - struct format strings are "wonky" (in that they're nothing like those of str.format) - struct format strings simply can't deal with mixing literal "character bytes" and format specs, making formats with fixed ascii structures significantly less readable
"format(n, '6d').encode('ascii')" is the right way to get the string representation of a number as ASCII bytes. However, the programmer needs to be aware that concatenating those bytes with an encoding that is not ASCII compatible (such as UTF-16, UTF-32, or many of the Asian encodings) will result in a sequence of unusable garbage. It is far, far safer to transform everything into the text domain, work with it there, then encode back when the manipulation is complete. Sure, but as you noted this is not even always done in the stdlib, why third-party developers would be expected to be in a better situation?
And between jumping through a semi-arbitrary decode/encode cycle whose semantics are completely ignored and being able to just specify a bytes pattern, which seems stranger? And I'm probably overstating its importance, but erlang seems to do rather well with its bit syntax. Which is much closer to str.format than to struct.pack (in API, in looks, in complexity, …)
On 5/26/2011 7:17 AM, Masklinn wrote:
Considering the original use case,
to prefix ascii-encoded numbers to lines in an unknown but ascii-compatible encoding*, and considering the responses since my last post, I have changed from -0 to -1 to the alias proposal. 1. The use case does not need the fake decoding and is better off without it. 2. I suspect the uses cases where fake decoding is both needed and sufficient are relatively rare. 3. Fake decoding is dangerous (Lemburg). 4. People who know enough to use it safely should already know about how latin-1 relates to unicode, and therefore do not need an alias. 5. Other people should not be encouraged to use it as a fake. *I meant to ask earlier whether there are ascii-incompatible encodings for which the original code and my revision would not work. I gather from the responses that yes, there are some. -- Terry Jan Reedy
Masklinn wrote:
would it make more sense to be able to create "byte patterns", with formats similar to those of str.format but not identical (e.g. better control on layout would be nice, something similar to Erlang's bit syntax for putting binaries together).
Sounds a lot like struct.pack. Maybe struct.pack and struct.unpack could be made available as methods of bytes? I don't think this would address the OP's use case, though, because he seems to actually want a textual format whose output is encoded in ascii. -- Greg
On Fri, May 27, 2011 at 7:26 AM, Terry Reedy <tjreedy@udel.edu> wrote:
On 5/26/2011 7:17 AM, Masklinn wrote:
Considering the original use case,
to prefix ascii-encoded numbers to lines in an unknown but ascii-compatible encoding*, and considering the responses since my last post, I have changed from -0 to -1 to the alias proposal.
1. The use case does not need the fake decoding and is better off without it. 2. I suspect the uses cases where fake decoding is both needed and sufficient are relatively rare. 3. Fake decoding is dangerous (Lemburg). 4. People who know enough to use it safely should already know about how latin-1 relates to unicode, and therefore do not need an alias. 5. Other people should not be encouraged to use it as a fake.
OK, I understand that using 'latin1' is just a hack and not Pythonic way. Then, I hope bytes has a fast and efficient "format" method like:
b'{0} {1}'.format(23, b'foo') # accepts int, float, bytes, bool, None 23 foo b'{0}'.format('foo') # raises TypeError for other types. TypeError
And line buffering in binary mode is also nice.
*I meant to ask earlier whether there are ascii-incompatible encodings for which the original code and my revision would not work. I gather from the responses that yes, there are some.
-- Terry Jan Reedy
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- INADA Naoki <songofacandy@gmail.com>
INADA Naoki writes:
Any thoughts?
-1 TOOWTDI. No alias, please. It's just an idiom people who need the functionality will need to learn (but see comment on urllib.parse below). As Terry says, it's hard to believe that use of the latin1 codec and str for internal processing is going to be a bottleneck in practical applications. I wonder if it would be possible to generalize Nick's work on urllib.parse to a more general class.
On Fri, May 27, 2011 at 12:59 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I wonder if it would be possible to generalize Nick's work on urllib.parse to a more general class.
I thought about that when I was implementing it, and I don't really think so. The decode/encode cycle in urllib.parse is based on a few key elements: 1. The URL standard itself mandates a 7-bit ASCII bytestream. The implicit conversion accordingly uses the ascii codec with strict error handling, so if you want to handle malformed URLs, you still have to do your own decoding and pass in already decoded text strings rather than the raw bytes (as there is no way for the library to guess an appropriate encoding for any non-ASCII bytes it encounters). 2. The affected urllib.parse APIs are all stateless - the output is determined by the inputs. Accordingly, it was fairly straightforward to coerce all of the arguments to strings and also create a "coerce result" callable that is either a no-op that just returns its argument (string inputs) or calls .encode() on its input and returns that (bytes/bytearray inputs) 3. All of the operations that returned tuples were updated to return namedtuple subclasses with an encode() method that passed the encoding command down to the individual tuple elements. These subclasses all came in matched pairs (one that held only strings, another that held only bytes). The argument coercion function could probably be extracted and placed in the string module, but it isn't all that useful on its own - it's adequate if you're only returning single strings, but needs to be matched with an appropriately designed class hierarchy if you're returning anything more complicated. I believe RDM used a similar design pattern of parallel bytes and string based return types to get the email package into a more usable state for 3.2. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, May 27, 2011 at 12:02 PM, INADA Naoki <songofacandy@gmail.com> wrote:
Then, I hope bytes has a fast and efficient "format" method like:
b'{0} {1}'.format(23, b'foo') # accepts int, float, bytes, bool, None 23 foo b'{0}'.format('foo') # raises TypeError for other types. TypeError
What method is invoked to convert the numbers to text? What encoding is used to convert those numbers to text? How does this operation avoid also converting the *bytes* object to text and then reencoding it? Bytes are not text. Struggling against that is a recipe for making life hard for yourself in Python 3. That said, there *may* still be a place for bytes.format(). However, proper attention needs to be paid to the encoding issues, and the question of how arbitrary types can be supported (including how to handle the fast path for existing bytes() and bytearray() objects). The pedagogic cost of making it even harder than it already is to convince people that bytes are not text would also need to be considered.
And line buffering in binary mode is also nice.
The Python 3 IO stack already provides b'\n' based line buffering for binary files. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, May 27, 2011 at 2:24 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Fri, May 27, 2011 at 12:02 PM, INADA Naoki <songofacandy@gmail.com> wrote:
Then, I hope bytes has a fast and efficient "format" method like:
b'{0} {1}'.format(23, b'foo') # accepts int, float, bytes, bool, None 23 foo b'{0}'.format('foo') # raises TypeError for other types. TypeError
What method is invoked to convert the numbers to text?
Doesn't invoke any methods. Please imagine stdio's pritnf.
What encoding is used to convert those numbers to text? How does this operation avoid also converting the *bytes* object to text and then reencoding it?
I've wrote a wrong example.
b'{0} {1}'.format(23, b'foo') # accepts int, float, bytes, bool, None 23 foo
This should be b'23 foo'. Numbers encoded by ascii.
Bytes are not text. Struggling against that is a recipe for making life hard for yourself in Python 3.
I love unicode and use unicode when I can use it. But this is a problem in the real world. For example, Python 2 is convenient for analyzing line based logs containing some different encodings. Python 3
That said, there *may* still be a place for bytes.format(). However, proper attention needs to be paid to the encoding issues, and the question of how arbitrary types can be supported (including how to handle the fast path for existing bytes() and bytearray() objects). The pedagogic cost of making it even harder than it already is to convince people that bytes are not text would also need to be considered.
And line buffering in binary mode is also nice.
The Python 3 IO stack already provides b'\n' based line buffering for binary files.
But the doc says that "1 to select line buffering (only usable in text mode)," http://docs.python.org/dev/library/functions.html#open
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
-- INADA Naoki <songofacandy@gmail.com>
On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy@gmail.com> wrote:
But the doc says that "1 to select line buffering (only usable in text mode)," http://docs.python.org/dev/library/functions.html#open
True, I was thinking about the public API (readline/readlines) rather than the underlying buffering. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy@gmail.com> wrote:
I love unicode and use unicode when I can use it. But this is a problem in the real world. For example, Python 2 is convenient for analyzing line based logs containing some different encodings. Python 3
...deliberately makes that difficult because it is *wrong*. Binary files containing a mixture of encodings cannot be safely treated as text. The closest it is possible to get is to support only ASCII compatible encodings by decoding it as ASCII with the "surrogateescape" error handler so that bytes with the high order bit set can be faithfully reproduced on reencoding. However, such code will potentially fail once it encounters a non-ASCII compatible encoding, such as UTF-16 or -32. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan writes:
On Fri, May 27, 2011 at 12:02 PM, INADA Naoki <songofacandy@gmail.com> wrote:
Then, I hope bytes has a fast and efficient "format" method like:
I still don't see a use case for a fast and efficient bytes.format() method. The latin-1 codec is O(n) with a very small coefficient. It seems to me this is "really" all about TOOWTDI: we'd like to be able to interpolate data received as arguments into a data stream using the same idiom everywhere, whether the stream consists of text, bytes, or class Froooble instances. (I admit I don't offhand know how you'd spell "{0}" in a Froooble stream.) OK, so at present only bytes is a plausible application, but I'm willing to go there. Then, if it turns out that the latin-1 codec imposes too high overhead on .format() in some application, the concerned parties can optimize it.
b'{0} {1}'.format(23, b'foo') # accepts int, float, bytes, bool, None
I don't see a use case for accepting bool or None. I hadn't thought about float, but are you really gonna need it? On-the-fly generation of CSS "'{0}em'.format(0.5)" or something like that, I guess?
23 foo
b'{0}'.format('foo') # raises TypeError for other types.
Philip Eby has a use case for accepting str as long as the ascii codec in strict error mode works on the particular instances of str. Although I'm not sure he would consider a .format() method efficient enough, ISTR he wanted the compiler to convert literals.
TypeError
What method is invoked to convert the numbers to text? What encoding is used to convert those numbers to text? How does this operation avoid also converting the *bytes* object to text and then reencoding it?
OTOH, Nick, aren't you making this harder than it needs to be? After all,
Bytes are not text.
Precisely. So bytes.format() need not handle *all* text-like manipulations, just protocol magic that puns ASCII-encoded text. If a bytes object is displayed sorta like text, then it *is* *all* bytes in the ASCII repertoire (not even the right half of Latin-1 is allowed). In bytes.format(), bytes are bytes, they don't get encoded, they just get interpolated into the bytes object being created. For other stuff, especially integers, if there is a conventional represention for it in ASCII, it *might* be an appropriate conversion for bytes.format() (but see above for my reservations about several common Python types). str (Unicode) might be converted via the ascii codec in strict errors mode, although the purist in me really would rather not go there. AFAICS, this handles all use cases presented so far.
The pedagogic cost of making it even harder than it already is to convince people that bytes are not text would also need to be considered.
This bothers me quite a bit, but my sense is that practicality is going to beat purity (into a bloody pulp :-P) once again.
On Fri, May 27, 2011 at 6:46 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
> What method is invoked to convert the numbers to text? What encoding > is used to convert those numbers to text? How does this operation > avoid also converting the *bytes* object to text and then reencoding > it?
OTOH, Nick, aren't you making this harder than it needs to be? After all,
To me, the defining feature of str.format() over str.__mod__() is the ability for types to provide their own __format__ methods, rather than being limited to a predefined set of types known to the interpreter. If bytes were to reuse the same name, then I'd want to see similar flexibility. Now, a *different* bytes method (bytes.interpolate, perhaps?), limited to specific types may make sense, but such an alternative *shouldn't* be conflated with the text formatting API. However, proponents of such an addition need to clearly articulate their use cases and proposed solution in a PEP to make it clear that they aren't merely trying to perpetuate the bytes/text confusion that plagues 2.x 8-bit strings. We can almost certainly do better when it comes to constructing byte sequences from component parts, but simply saying "oh, just add a format() method to bytes objects" doesn't cut it, since the associated magic methods for str.format are all string based, and bytes interpolation also needs to address encoding issues for anything that isn't already a byte sequence. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 2011-05-27, at 11:27 , Nick Coghlan wrote:
On Fri, May 27, 2011 at 6:46 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
What method is invoked to convert the numbers to text? What encoding is used to convert those numbers to text? How does this operation avoid also converting the *bytes* object to text and then reencoding it?
OTOH, Nick, aren't you making this harder than it needs to be? After all,
To me, the defining feature of str.format() over str.__mod__() is the ability for types to provide their own __format__ methods, rather than being limited to a predefined set of types known to the interpreter. If bytes were to reuse the same name, then I'd want to see similar flexibility.
Now, a *different* bytes method (bytes.interpolate, perhaps?), limited to specific types may make sense, but such an alternative *shouldn't* be conflated with the text formatting API.
However, proponents of such an addition need to clearly articulate their use cases and proposed solution in a PEP to make it clear that they aren't merely trying to perpetuate the bytes/text confusion that plagues 2.x 8-bit strings.
We can almost certainly do better when it comes to constructing byte sequences from component parts, but simply saying "oh, just add a format() method to bytes objects" doesn't cut it, since the associated magic methods for str.format are all string based, and bytes interpolation also needs to address encoding issues for anything that isn't already a byte sequence.
I don't see anything I could disagree with. Especially not in the last paragraph.
Nick Coghlan writes:
On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy@gmail.com> wrote:
I love unicode and use unicode when I can use it. But this is a problem in the real world. For example, Python 2 is convenient for analyzing line based logs containing some different encodings.
Where's the use case for bytes here?
Python 3
...deliberately makes that difficult because it is *wrong*.
Nick, you should have stopped there. :-) I can see very little difference between Python 2 and Python 3 in this use case, except that Python 2 makes it much easier to write easily crashable programs. In both versions, the safe thing to do for such a program is either to slurp the whole log with open(log, encoding=<whatever>, errors=<something nonfatal>) (that's Python 3 code; Python 2 makes this more tedious, in fact). But no need for reading as bytes in Python 3 visible here, move along, people! Alternatively, one could write a function that reads lines from the log as bytes, and tries different encodings for each line (perhaps interacting with the user) and eventually uses some default encoding and a nonfatal error handler to get *something*. This requires reading as bytes, but it's no easier to write in Python 2 AFAICS. Granted, such a function will not easily be portable between Python 2 and 3, but that's a different problem.
Binary files containing a mixture of encodings cannot be safely treated as text.
"Safety" is use-case-dependent. I suppose Inada-san considers using Python 2 strs to receive file input safe enough for his log analyzer. While we shouldn't encourage that (and either errors='ignore' or errors='surrogateescape' should be easy enough for him in the log analysis case[1]), I don't think we should demand GIGO with 100% fidelity in all use cases, either. Footnotes: [1] In new code. Again, a port of existing Python 2 code to Python 3 might not be trivial, depending on how he handles unexpected encodings and how pervasively they are manipulated in his program.
Nick Coghlan writes:
To me, the defining feature of str.format() over str.__mod__() is the ability for types to provide their own __format__ methods,
Ah, so you object to the _spelling_, not the requested functionality. (At least, not all of it.) All is clear now! OK, I retract my suggestion, but I'll let you beat up on anybody who dredges it up in the future. Specifically, I think that calling it "bytes.format" (a) is discoverable and (b) it is not obvious to me that __format_bytes__ functionality for arbitrary types is a bad thing, although I personally have no use case and am unlikely to catch one for a while (thus at most I'm now -0, and could easily be persuaded to lower that).
bytes interpolation also needs to address encoding issues for anything that isn't already a byte sequence.
Sure, but my proposal here still stands: whatever the API is, and whatever types it supports, the assumption is that interpolation uses the conventional ASCII representation for the given type (and for interpolations implemented in stdlib there had better be universal agreement on what that convention is).
On Fri, May 27, 2011 at 9:07 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Nick Coghlan writes:
> To me, the defining feature of str.format() over str.__mod__() is the > ability for types to provide their own __format__ methods,
Ah, so you object to the _spelling_, not the requested functionality. (At least, not all of it.) All is clear now!
OK, I retract my suggestion, but I'll let you beat up on anybody who dredges it up in the future. Specifically, I think that calling it "bytes.format" (a) is discoverable and (b) it is not obvious to me that __format_bytes__ functionality for arbitrary types is a bad thing, although I personally have no use case and am unlikely to catch one for a while (thus at most I'm now -0, and could easily be persuaded to lower that).
In the specific case of adding bytes.format(), it's the weight of the backing machinery that bothers me - the PEP 3101 implementation isn't small, and providing a parallel API for bytes without slowing down the existing string implementation would be problematic (code re-use would likely slow down the common case even further, while avoiding re-use would likely end up duplicating a lot of code). However, *if* a solid set of use cases for direct bytes interpolation can be identified (and that's a big if), then it may be possible to devise a narrower, more focused API that doesn't require such a heavy back end to support it. But the use cases have to come first, and ones that are better expressed via techniques such as ASCII decoding with the surrogateescape error handler to support round-tripping don't count. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
The pedagogic cost of making it even harder than it already is to convince people that bytes are not text would also need to be considered.
I think that boat was missed some time ago. If there were ever a serious intention to teach people that bytes are not text by limiting the feature set of bytes, it would have been better served by not giving bytes *any* features that assumed a particular encoding. As it is, bytes has quite a lot of features that implicitly treat it as ascii-encoded text: the literal and repr() forms, capitalize(), expandtabs(), lower(), splitlines(), swapcase(), title(), upper(), and all the is*() methods. Accepting all of that, and then saying "Oh, no, we couldn't possibly provide a format() method, because bytes are not text" seems a tad inconsistent. -- Greg
On Sat, May 28, 2011 at 10:55 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nick Coghlan wrote:
The pedagogic cost of making it even harder than it already is to convince people that bytes are not text would also need to be considered.
I think that boat was missed some time ago. If there were ever a serious intention to teach people that bytes are not text by limiting the feature set of bytes, it would have been better served by not giving bytes *any* features that assumed a particular encoding.
As it is, bytes has quite a lot of features that implicitly treat it as ascii-encoded text: the literal and repr() forms, capitalize(), expandtabs(), lower(), splitlines(), swapcase(), title(), upper(), and all the is*() methods.
Accepting all of that, and then saying "Oh, no, we couldn't possibly provide a format() method, because bytes are not text" seems a tad inconsistent.
Originally we didn't have all of that - more and more of it crept back in at the behest of several binary protocol folks (including me, if I recall correctly). The urllib.parse experience has convinced me that giving in to that pressure was a mistake. We went for a premature optimisation, and screwed up the bytes API as a result. Yes, there is a potential performance issue with the decode/process/encode model, but simple keeping a bunch of string methods in the bytes API was the wrong answer (and something that isn't actually all that useful in practice, for the reasons brought up in this and other recent threads). Perhaps it is time to resurrect the idea of an explicit 'ascii' type? Add a'' literals, support the full string API as well as the bytes API, deprecate all string APIs on bytes and bytearray objects. The other thing I have learned in trying to deal with some of these issues is that ASCII-encoded text really *is* special, compared to all other encodings, due to its widespread use in a multitude of networking protocols and other formats. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
Perhaps it is time to resurrect the idea of an explicit 'ascii' type? Add a'' literals, support the full string API as well as the bytes API, deprecate all string APIs on bytes and bytearray objects.
That sounds like an idea worth pursuing. Maybe also introduce an x'...' literal for bytes at the same time, with a view to eventually deprecating and removing the b'...' syntax. I don't think I would remove *all* the string methods from bytes, only the ones that assume ascii encoding. Searching and replacing substrings etc. still makes sense on arbitrary bytes. How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode, or should an explicit decode() be required? -- Greg
Greg Ewing wrote:
Nick Coghlan wrote:
Perhaps it is time to resurrect the idea of an explicit 'ascii' type? Add a'' literals, support the full string API as well as the bytes API, deprecate all string APIs on bytes and bytearray objects.
That sounds like an idea worth pursuing. Maybe also introduce an x'...' literal for bytes at the same time, with a view to eventually deprecating and removing the b'...' syntax.
I don't think I would remove *all* the string methods from bytes, only the ones that assume ascii encoding. Searching and replacing substrings etc. still makes sense on arbitrary bytes.
How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode, or should an explicit decode() be required?
And what happens when a char > 127 hits the ascii stream? As for unicode interoperation, I'm inclined to let it be implicit, since ascii directly overlaps unicode. Depending, of course, on the answer to the above question. ~Ethan~
On 5/27/2011 7:51 AM, Nick Coghlan wrote:
In the specific case of adding bytes.format(), it's the weight of the backing machinery that bothers me - the PEP 3101 implementation isn't small, and providing a parallel API for bytes without slowing down the existing string implementation would be problematic (code re-use would likely slow down the common case even further, while avoiding re-use would likely end up duplicating a lot of code). However, *if* a solid set of use cases for direct bytes interpolation can be identified (and that's a big if), then it may be possible to devise a narrower, more focused API that doesn't require such a heavy back end to support it.
In Python 2.x str.format() and unicode.format() share the same implementation, using the Objects/stringlib mechanism of #defines and multiple includes. So while you do get the compiled code included twice, there's only one source file that implements them both. I don't think there's any concern about performance issues. And Python 3.x has the exact same implementation, although it's only included for unicode strings. It would not be difficult to add .format() for bytes. There have been various discussions over the years of how to actually do that. I think the most recent one was to add an __bformat__ method. I'm not saying any of this is a good idea or desirable. I'm just saying it would be easy to do and wouldn't hurt the performance of unicode.format(). Eric.
On Sat, May 28, 2011 at 7:43 PM, Eric Smith <eric@trueblade.com> wrote:
There have been various discussions over the years of how to actually do that. I think the most recent one was to add an __bformat__ method.
Python 2.x was different, as the automatic unicode coercion meant class developers still only needed to provide __str__ (or __unicode__ if they wanted to return non-ASCII data). __bformat__ (and similar ideas) are somewhat different beasts due to the encoding issues involved. Those aren't insurmountable, but they're things that don't come up with pure unicode handling (2.x unicode, 3.x str) or data that is essentially assumed to be latin-1 encoded in many cases (2.x str)
I'm not saying any of this is a good idea or desirable. I'm just saying it would be easy to do and wouldn't hurt the performance of unicode.format().
I'm still not sure about that, since the 2.x str.format() pretty much ignores the associated encoding problems, and I don't believe perpetuating that behaviour would be appropriate for 3.x bytes. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, May 28, 2011 at 12:23 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Greg Ewing wrote:
How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode, or should an explicit decode() be required?
And what happens when a char > 127 hits the ascii stream?
These are the kinds of questions that make it clear that the answer here is far from being as simple as merely adding more string methods to the existing bytes type. The underlying data model is simply *wrong* for working with bytes as if they were text. For a previous, more flexible, incarnation of this idea, Barry's post is the earlier record I found of the idea of a byte sequence oriented type that carried its encoding metadata along with it: http://mail.python.org/pipermail/python-dev/2010-June/100777.html However, supporting multi-byte codes (and other stateful codecs like ShiftJIS) poses problems for slicing operations (just as it does for us already in Unicode slicing). Hence the possibility of strictly limiting this to 7-bit ASCII - the main problem with most bytes-as-text suggestions is that they don't work for arbitrary subsets of the codecs available in the standard library and it generally isn't entirely clear which codecs will work and which ones won't. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, May 28, 2011 at 12:43 PM, Eric Smith <eric@trueblade.com> wrote:
And Python 3.x has the exact same implementation, although it's only included for unicode strings. It would not be difficult to add .format() for bytes.
There have been various discussions over the years of how to actually do that. I think the most recent one was to add an __bformat__ method.
Well, that's actually great idea I think. format method on bytes could produce some data which is not an ascii, and eventually became struct.pack on steroids. The struct.pack has plenty of problems: * unable to use named fields, which is usefull to describe big structures * all fields are fixed-length, which is unfortunate for today's trend of variable length integers * can't specify separators between fields I also use str(intvalue).encode('ascii') idiom a lot. So probably I'd suggest to have something like __bformat__ with format values somewhat similar to ones struct.pack has along with str-like ones for integers. Also it might be useful to have `!len` conversion for bytes fields, for easier encoding of length-prefixed strings. To show an example, here is how two-chunk png file can be encoded: (b"\x89PNG\r\n\x1A\n" b"{s1!len:>L}IHDR{s1}{crc1:>L}" b"{s2!len:>L}IDAT{s2}{crc2:>L}\0\0\0\0IEND".format( s1=section1, crc1=crc(section1), s2=section2, crc2=crc(section2))) -- Paul
Greg Ewing writes:
How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode,
Definitely not! Bytes are not text, and the programmer must say when they want those bytes decoded. The Python translator must not be asked to guess.
or should an explicit decode() be required?
Simplest. But IMHO worth considering is an implicit coercion of Unicode to ascii via decode() with strict errors. Remember, Unicode is an invertible mapping of characters to abstract integers, which may be represented in various different ways, such as bytes, 32-bit words, or UTF-8. So in some sense there is no violation of the Unicode type here. Sorry, I can't explain more clearly at the moment, but I have a strong sense that coercion (ASCII) bytes -> Unicode *changes* or maybe even "destroys" the type of the byte, while the coercion (ASCII) Unicode -> bytes takes an abstract type "Unicode" and refines to a concrete type "bytes". Among other things, this is always reversible. This takes into account the common usage of punning natural language encoded in ASCII on binary protocol magic numbers. Then one could write stuff like my_pipe.write('HELO ' + my_fqdn) while true pedants would of course write my_pipe.write(b'HELO ' + my_fqdn) This doesn't explain how to make it easy to ensure that my_fqdn is bytes, of course, and that makes me uneasy about whether this would actually be useful, or merely confusing. (However, there are use cases where it is claimed that 'HELO ' is needed both as str and as bytes.)
On Mon, May 30, 2011 at 12:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
(However, there are use cases where it is claimed that 'HELO ' is needed both as str and as bytes.)
My current opinion is that all of this still needs more experimentation outside the core before we start fiddling any further with the builtins (we blinked once in the lead-up to 3.0 by allowing bytes and bytearray to retain a lot of string methods that assume an ASCII compatible encoding, and I now have my doubts about the wisdom of even that step). I don't have a good answer on how to deal with the real world situations where the *use case* blurs the bytes/text distinction (typically by embedding ASCII text inside an otherwise binary protocol), and given the potential to backslide into the bad old days of 8-bit strings, I'm not prepared to guess, either. 3.x has largely cleared the decks to allow a better solution to evolve in this space by making it harder to blur the line accidentally, and decode()/manipulate/encode() already nicely covers many stateless use cases. If it turns out we need another type, or some other API, to deal gracefully with any use cases where that isn't enough, then so be it. However, I think we need to let the status quo run for a while longer and see what people actually using the current types in production come up with. The bytes/text division in Python 3 is by far the biggest conceptual change between the two languages, so it's going to take some time before we can figure out how many of the problems encountered are real issues with the split model not covering some use cases and how many are just people (including us) taking time to get used to the sharp division between the two worlds. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On May 29, 2011, at 9:45 PM, Nick Coghlan wrote:
On Mon, May 30, 2011 at 12:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
(However, there are use cases where it is claimed that 'HELO ' is needed both as str and as bytes.)
My current opinion is that all of this still needs more experimentation outside the core before we start fiddling any further with the builtins (we blinked once in the lead-up to 3.0 by allowing bytes and bytearray to retain a lot of string methods that assume an ASCII compatible encoding, and I now have my doubts about the wisdom of even that step). I don't have a good answer on how to deal with the real world situations where the *use case* blurs the bytes/text distinction (typically by embedding ASCII text inside an otherwise binary protocol), and given the potential to backslide into the bad old days of 8-bit strings, I'm not prepared to guess, either.
+1 Raymond
Changing the subject to what it has actually become. On 5/27/2011 5:27 AM, Nick Coghlan wrote:
We can almost certainly do better when it comes to constructing byte sequences from component parts, but simply saying "oh, just add a format() method to bytes objects" doesn't cut it, since the associated magic methods for str.format are all string based,
STRING FORMATTING From a modern and Python viewpoint, string formatting is about interpolating text representations of objects into a text template. By default, the text representation is str(object). Exception 1. str.format has an optional conversion specifier "!s/r/a" to specify repr(object) or ascii(object) instead of str(object). (It can also be used to overrides exception 2.) This is not relevant to bytes formatting. Exception 2.str.format, like % formatting, does special processing of numbers. Electronic computing was originally used only to compute numbers and text formatting was originally about formatting numbers, usually in tables, with optional text decoration. That is why the maximum field size for string interpolation is still called 'precision'. There are numerous variations in number formatting and most of the complication of format specifications arise therefrom. BYTES FORMATTING If the desired result consists entirely of text encoded with one encoding, the current recommended method is to construct the text and encode. I think this is the proper method and do not think that anything we add should be aimed at this use case. There are two other current methods to assemble bytes from pieces. One is concatenation; it has the same advantages and disadvantages of string concatenation. Another, overlooked in the current discussion so afr, is in-place editing of a bytearray by index and slice assignment. It has the disadvantage of having to know the correct indexes and slice points. If we add another bytes formatting function or method, I think it should be about interpolating bytes into a bytes template. The use cases would be anything other than mono-encoded text -- text with multiple encodings or non-text bytes possibly intermixed with encoded text.
and bytes interpolation also needs to address encoding issues for anything that isn't already a byte sequence.
As indicated above, I disagree if 'encoding' means 'text encoding'. Let .encode handle encoding issues. PROPOSAL A bytes template uses b'{' and b'}' to mark interpolation fields and other ascii bytes within as needed. It uses the ascii equivalent of the string field_name spec. It does not have a conversion spec. The format_spec should have the minimum needed for existing public protocols. How much more is up for discussion. We need use cases. One possibility to keep in mind is that a bytes template could constructed by an ascii-compatible encoding of formatted text. Specs for bytes fields can be protected in a text template by doubling the braces.
'{} {{byte-field-spec}}'.format(1).encode() b'1 {byte-field-spec}'
A major issue is what to do with numbers. Sometimes they needed to be ascii encoded, sometime binary encoded. The baseline is to do nothing extra and require all args to be bytes. I think this may be appropriate for floats as they are seldom specifically used in protocols. I think the same may be true for ints with signs. So I think we mainly need to consider counts (unsigned ints) for possible exceptional processing. Option 0. As stated, no special number specs. Option 1. Use a subset of the current int spec to produce ascii encodings; use struct.pack for binary encodings. (How many of the current integer presentation types would be needed?) Option 2. Use an adaptation of the struct.pack mini-language to produce binary encodings; use encoded str.format for ascii encodings. (The latter might be done as part of a text-to-bytes-template process as indicated above.) Option 3. Combine options 1 and 2. This might best be done by replacing the omitted 'conversion' field with a 'number-encoding' field, b'!a' or b'!b', to indicate ascii or binary conversion and corresponding interpretation of the format spec. (In other words, do not try to combine the number to text and number to binary mini-languages, but add a 'prefix' to specify which is being used.) -- Terry Jan Reedy
On Sun, May 29, 2011 at 9:45 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, May 30, 2011 at 12:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
(However, there are use cases where it is claimed that 'HELO ' is needed both as str and as bytes.)
My current opinion is that all of this still needs more experimentation outside the core before we start fiddling any further with the builtins (we blinked once in the lead-up to 3.0 by allowing bytes and bytearray to retain a lot of string methods that assume an ASCII compatible encoding, and I now have my doubts about the wisdom of even that step). I don't have a good answer on how to deal with the real world situations where the *use case* blurs the bytes/text distinction (typically by embedding ASCII text inside an otherwise binary protocol), and given the potential to backslide into the bad old days of 8-bit strings, I'm not prepared to guess, either.
3.x has largely cleared the decks to allow a better solution to evolve in this space by making it harder to blur the line accidentally, and decode()/manipulate/encode() already nicely covers many stateless use cases. If it turns out we need another type, or some other API, to deal gracefully with any use cases where that isn't enough, then so be it. However, I think we need to let the status quo run for a while longer and see what people actually using the current types in production come up with. The bytes/text division in Python 3 is by far the biggest conceptual change between the two languages, so it's going to take some time before we can figure out how many of the problems encountered are real issues with the split model not covering some use cases and how many are just people (including us) taking time to get used to the sharp division between the two worlds.
Well said, Nick. We ought to attempt to live with the current situation for quite a bit longer before stirring the pot again. My feeling is that one of the main reasons why this topic keeps coming up is simply that it is different from Python 2 -- this is "the year of Python 3" so more people than ever before are discovering the differences between Python 2 and 3. Most people's minds probably haven't switched over, and the solutions and attitudes that worked in Python 2 don't always work so well in Python 3. Let's also remember that while Python is not exactly blazing a new trail here, it is also not following the most conservative course. Most languages of Python's vintage or older are still using a model that blurs the line between text and binary data, representing Unicode text as bytes that happen to be encoded in some encoding. Even if the language assumes a default encoding this doesn't mean that all data manipulated is actually text encoded in that encoding -- it just means that you may get nonsense when you use text operations on data that uses some other encoding, just as you get nonsense when you use text operations on binary data (e.g. using readlines() on a JPEG file). Python lets you do this too, to some extent, with some of the text operations on bytes data, and this is definitely a compromise. I hope that we have built in just enough friction to remind people that this is not the best way to deal with text most of the time, while still allowing advanced users who are writing e.g. parsers for Internet protocols to stay at the bytes layer at a reasonable cost. Personally I think we got this close enough to right that we won't having to rethink the whole thing, even if small tweaks might be possible; but there's no need to rush. -- --Guido van Rossum (python.org/~guido)
Stephen J. Turnbull wrote:
Greg Ewing writes:
How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode,
Definitely not! Bytes are not text, and the programmer must say when they want those bytes decoded.
But the proposed 'ascii' type *is* text, though. Whether it's a good idea to auto-coerce I'm not sure, but it's not obviously wrong to do so. -- Greg
On 30/05/2011 21:04, Terry Reedy wrote:
Changing the subject to what it has actually become. PROPOSAL
A bytes template uses b'{' and b'}' to mark interpolation fields and other ascii bytes within as needed. It uses the ascii equivalent of the string field_name spec. It does not have a conversion spec. The format_spec should have the minimum needed for existing public protocols. How much more is up for discussion. We need use cases.
One possibility to keep in mind is that a bytes template could constructed by an ascii-compatible encoding of formatted text. Specs for bytes fields can be protected in a text template by doubling the braces.
'{} {{byte-field-spec}}'.format(1).encode() b'1 {byte-field-spec}'
A major issue is what to do with numbers. Sometimes they needed to be ascii encoded, sometime binary encoded. The baseline is to do nothing extra and require all args to be bytes. I think this may be appropriate for floats as they are seldom specifically used in protocols. I think the same may be true for ints with signs. So I think we mainly need to consider counts (unsigned ints) for possible exceptional processing.
Option 0. As stated, no special number specs.
Option 1. Use a subset of the current int spec to produce ascii encodings; use struct.pack for binary encodings. (How many of the current integer presentation types would be needed?)
Option 2. Use an adaptation of the struct.pack mini-language to produce binary encodings; use encoded str.format for ascii encodings. (The latter might be done as part of a text-to-bytes-template process as indicated above.)
Option 3. Combine options 1 and 2. This might best be done by replacing the omitted 'conversion' field with a 'number-encoding' field, b'!a' or b'!b', to indicate ascii or binary conversion and corresponding interpretation of the format spec. (In other words, do not try to combine the number to text and number to binary mini-languages, but add a 'prefix' to specify which is being used.)
Perhaps something like this: # Format int as byte. b"{:b}".format(128) returns b"\x80" # Format int as double-byte. b"{:2b}".format(0x100) returns b"\x00\x01" or b"\x01\x00" # Format int as double-byte, little-endian. b"{:<2b}".format(0x100) returns b"\x00\x01" # Format int as double-byte, big-endian. b"{:>2b}".format(0x100) returns b"\x01\x00" # Format list of ints as signed bytes. b"{:s}".format([1, -2, 3]) returns b"\x01\xFE\x03" # Format list of ints as unsigned bytes. b"{:u}".format([1, 254, 3]) returns b"\x01\xFE\x03" # Format ASCII-only string as bytes. b"{:a}".format("abc") returns b"abc"
Greg Ewing writes:
Stephen J. Turnbull wrote:
Greg Ewing writes:
How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode,
Definitely not! Bytes are not text, and the programmer must say when they want those bytes decoded.
But the proposed 'ascii' type *is* text, though.
If it's intended that the 'ascii' type *be* text, I don't see the point. It *is* Unicode (with a restricted range), and no coercion is necessary between str and 'ascii', just a change of representation. This can be done completely transparently[1], no need for a new type, except that some effort on the part of implementer can be saved by imposing ongoing annoyance on the application programmer. But even as a separate type, 'ascii' still can't mix with bytes safely, for the same reason that str can't mix with bytes: 'ascii' and str have a known fixed encoding (Unicode), and bytes have an unknown, variable encoding (possibly the non-encoding 'binary'). YAGNI... Footnotes: [1] For some use cases it might be useful to allow specifying the representation in advance, as a micro-optimization.
Stephen J. Turnbull wrote:
But even as a separate type, 'ascii' still can't mix with bytes safely,
Yes, it can, because it's also bytes. :-) If you're using the special ascii type at all, rather than an ordinary str, it's precisely because you want to mix it with bytes. Making that part hard would defeat the purpose, -- Greg
On Tue, May 31, 2011 at 5:32 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Stephen J. Turnbull wrote:
But even as a separate type, 'ascii' still can't mix with bytes safely,
Yes, it can, because it's also bytes. :-)
If you're using the special ascii type at all, rather than an ordinary str, it's precisely because you want to mix it with bytes. Making that part hard would defeat the purpose,
Indeed, the specific use case here is working with ASCII snippets embedded within ASCII compatible encodings (or otherwise demarcated from the 8-bit data). As I stated elsewhere, we still need more usage of Python 3 in production before we can find out whether or not this is a significant enough use case to require builtin support, or if third party libraries will be up to the task. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Greg Ewing writes:
Stephen J. Turnbull wrote:
But even as a separate type, 'ascii' still can't mix with bytes safely,
Yes, it can, because it's also bytes. :-)
To the extent that's safe, you may as well just use str and force encoding with the ascii codec and strict errors (as I suggested earlier). AFAICS, the argument that the visual signal of the special literal syntax helps is bogus. It doesn't help with variables; variables aren't typed in Python. It's still just as possible to type a'äëïöü', although it might make the mistake a little more visible. And in most cases, the use case for this feature will be very stylized, with a very small vocabulary of ASCII puns, written as literals at the point of combination with a bytes object. Anything else I can think of should be handled as text, via conversion to str. I just don't see a use case for an 'ascii' type, vs. coercing str to bytes and raising an error if the str is not all-ASCII.
If you're using the special ascii type at all, rather than an ordinary str, it's precisely because you want to mix it with bytes. Making that part hard would defeat the purpose,
Indeed. Most alleged use cases for "mixing" *should* be made hard to do by operating on bytes directly. Cf. the mixed-encoding log file example.
Nick Coghlan <ncoghlan@gmail.com> wrote:
Perhaps it is time to resurrect the idea of an explicit 'ascii' type? Add a'' literals, support the full string API as well as the bytes API, deprecate all string APIs on bytes and bytearray objects. The other thing I have learned in trying to deal with some of these issues is that ASCII-encoded text really *is* special, compared to all other encodings, due to its widespread use in a multitude of networking protocols and other formats.
I like the deprecations you suggest, but I'd prefer to see a more general solution: the 'str' type extended so that it had two possible representations for strings, the current format and an "encoded" format, which would be kept as an array of bytes plus an encoding. It would transcode only as necessary -- for example, the 're' module might require the current Unicode encoding. An explicit method would be added to allow the user to force transcoding. This would complicate life at the C level, to be sure. Though, perhaps not so much, given the proper macrology. Bill
On 5/31/2011 4:24 AM, Nick Coghlan wrote:
On Tue, May 31, 2011 at 5:32 PM, Greg Ewing
If you're using the special ascii type at all, rather than an ordinary str, it's precisely because you want to mix it with bytes. Making that part hard would defeat the purpose,
Indeed, the specific use case here is working with ASCII snippets embedded within ASCII compatible encodings (or otherwise demarcated from the 8-bit data).
My proposal for a function that interpolates bytes into bytes covers this case. There is no need for a new class at all. I agree that experience and experimentation is needed before adding anything to the atdlib. But here is a baseline version in Python: from itertools import zip_longest import re field = re.compile(b'{}') def bformat(template, *inserts): temlits = re.split(field, template) # template literals res = bytearray() for t,i in zip_longest(temlits, inserts, fillvalue=b''): res.extend(t) res.extend(i) return res print(bformat(b'xxx{}yyy{}zzz', b'help', b'me')) # bytearray(b'xxxhelpyyymezzz') This is, of course, not limited to the ascii subset of bytes. print(bformat(b'xx\xaa{}yy\xbb{}zzz', b'h\xeeelp', b'm\xeee')) #bytearray(b'xx\xaah\xeeelpyy\xbbm\xeeezzz') The next step would be to change the field re to allow a field spec between {} and add capturing parens so that re.split keeps the field specs. Then use those to format the inserted bytes or, later, ints. -- Terry Jan Reedy
On 5/30/2011 10:11 PM, MRAB wrote:
On 30/05/2011 21:04, Terry Reedy wrote:
Option 3. Combine options 1 and 2. This might best be done by replacing the omitted 'conversion' field with a 'number-encoding' field, b'!a' or b'!b', to indicate ascii or binary conversion and corresponding interpretation of the format spec. (In other words, do not try to combine the number to text and number to binary mini-languages, but add a 'prefix' to specify which is being used.)
Unless someone has a better idea of how to combine than I do ;-).
Perhaps something like this:
# Format int as byte. b"{:b}".format(128) returns b"\x80"
# Format int as double-byte. b"{:2b}".format(0x100) returns b"\x00\x01" or b"\x01\x00"
# Format int as double-byte, little-endian. b"{:<2b}".format(0x100) returns b"\x00\x01"
# Format int as double-byte, big-endian. b"{:>2b}".format(0x100) returns b"\x01\x00"
# Format list of ints as signed bytes. b"{:s}".format([1, -2, 3]) returns b"\x01\xFE\x03"
# Format list of ints as unsigned bytes. b"{:u}".format([1, 254, 3]) returns b"\x01\xFE\x03"
# Format ASCII-only string as bytes. b"{:a}".format("abc") returns b"abc"
Interesting. The core ideas of my proposal are * There are bytes construction cases not sensibly handled by test interpolation followed by encoding. Bytes concatenation and bytearray manipulation may be awkward, or follow patterns that can usefully be captures in a new function. * Bytes interpolation should only deal with bytes and maybe ints and have nothing to do with text encoding. * Design details should be based on use cases and experimentation with suggestions such as the above by people who would be the users of such a function. Experimental functions should be uploaded to pypi. -- Terry Jan Reedy
On Wed, Jun 1, 2011 at 2:16 AM, Bill Janssen <janssen@parc.com> wrote:
I like the deprecations you suggest, but I'd prefer to see a more general solution: the 'str' type extended so that it had two possible representations for strings, the current format and an "encoded" format, which would be kept as an array of bytes plus an encoding. It would transcode only as necessary -- for example, the 're' module might require the current Unicode encoding. An explicit method would be added to allow the user to force transcoding.
This would complicate life at the C level, to be sure. Though, perhaps not so much, given the proper macrology.
See PEP 393 - it is basically this idea (although the encodings are fixed for the various sizes rather than allowing arbitrary encodings in the 8-bit internal format). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan <ncoghlan@gmail.com> wrote:
On Wed, Jun 1, 2011 at 2:16 AM, Bill Janssen <janssen@parc.com> wrote:
I like the deprecations you suggest, but I'd prefer to see a more general solution: the 'str' type extended so that it had two possible representations for strings, the current format and an "encoded" format, which would be kept as an array of bytes plus an encoding. It would transcode only as necessary -- for example, the 're' module might require the current Unicode encoding. An explicit method would be added to allow the user to force transcoding.
This would complicate life at the C level, to be sure. Though, perhaps not so much, given the proper macrology.
See PEP 393 - it is basically this idea
Should have realized Martin would have thought of this :-). I'm not sure how I missed it back in January -- high drama at work distracted me, I guess. I might do it a bit differently, with just one pointer, say, "data", and a field which carries the encoding (possibly as a pointer to the appropriate codec). "data" would point to a buffer of the correct type. New strings would by default still be created as UCS-2 or UCS-4 Unicode, just as per today. I'd also allow any encoding which we have a codec for, so that if you are reading from a file containing encoded text, you can carry the exact bytes around unless you need to do something which isn't supported for that encoding -- in which case things get Unicodified behind the scenes. We'd smarten the various string methods over time so that most of them would work so long as the operands matched. str.index, for instance, wouldn't require decoding unless the two strings were of different encodings. Yes, there'd be some "magic" going on, but it wouldn't be worse than the automatic coercions Python does now -- that's just what a HLL does for you.
(although the encodings are fixed for the various sizes rather than allowing arbitrary encodings in the 8-bit internal format).
IMO, the thing that bit us on the fundament with the 2.x str/unicode divide, and continues to bite us with the 3.x str/bytes divide is that we don't carry the encoding as part of the 2.x 'str' value (or as part of the 3.x 'bytes' value). The key here is to store the encoding internally in the string object, so that it's available to do automatic coercion when necessary, rather than *requiring* all coercions to be done manually by some program code. Bill
On 6/1/2011 12:34 PM, Bill Janssen wrote:
IMO, the thing that bit us on the fundament with the 2.x str/unicode divide, and continues to bite us with the 3.x str/bytes divide is that we don't carry the encoding as part of the 2.x 'str' value (or as part of the 3.x 'bytes' value). The key here is to store the encoding internally in the string object, so that it's available to do automatic coercion when necessary, rather than *requiring* all coercions to be done manually by some program code.
Some time ago, I posted here a proposal to do just that -- add an encoding field to byte strings (or, I believe, add a new class). It was horribly shot down. Something like 'conceptually wrong, some bytes have 0 or multiple encodings, can just use an attribute or tuple, don't need it'. -- Terry Jan Reedy
Terry Reedy wrote:
On 6/1/2011 12:34 PM, Bill Janssen wrote:
IMO, the thing that bit us on the fundament with the 2.x str/unicode divide, and continues to bite us with the 3.x str/bytes divide is that we don't carry the encoding as part of the 2.x 'str' value (or as part of the 3.x 'bytes' value). The key here is to store the encoding internally in the string object, so that it's available to do automatic coercion when necessary, rather than *requiring* all coercions to be done manually by some program code.
Some time ago, I posted here a proposal to do just that -- add an encoding field to byte strings (or, I believe, add a new class). It was horribly shot down. Something like 'conceptually wrong, some bytes have 0 or multiple encodings, can just use an attribute or tuple, don't need it'.
A byte stream with multiple encodings? Now *that* seems wrong! It could also be handled by having the encoding field set to some special value indicating Unknown. ~Ethan~
On 6/1/2011 1:58 PM, Ethan Furman wrote:
A byte stream with multiple encodings? Now *that* seems wrong!
No, it is standard in many protocols. Ascii coded characters and numbers are mixed with binary coded numbers and binary blobs with their own codings. Bytes are not text, so don't think in terms of just text encodings. -- Terry Jan Reedy
On Thu, Jun 2, 2011 at 3:58 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
A byte stream with multiple encodings? Now *that* seems wrong!
Unicode encodings are just one serialisation format specific to text data. bytes objects may contain *any* serialisation format (e.g. zip archives, Python pickles, Python marshal files, packed binary data, innumerable wire protocols both standard and proprietary). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 6/2/2011 1:37 AM, Nick Coghlan wrote:
On Thu, Jun 2, 2011 at 3:58 AM, Ethan Furman<ethan@stoneleaf.us> wrote:
A byte stream with multiple encodings? Now *that* seems wrong!
Unicode encodings are just one serialisation format specific to text data. bytes objects may contain *any* serialisation format (e.g. zip archives, Python pickles, Python marshal files, packed binary data, innumerable wire protocols both standard and proprietary).
One result of this thread is that I see much better the value of separating the ancient human level concepts of character and text from the (3) decades old computer concept of byte. Numbers, lists, and dicts are other old human concepts. As Nick implies above, bytes (or bits within them) are used to encode all data for computer processing. The confusion of character with byte in the original design of Python both privileged and burdened text processing. -- Terry Jan Reedy
On Wed, Jun 1, 2011 at 11:30 PM, Terry Reedy <tjreedy@udel.edu> wrote:
The confusion of character with byte in the original design of Python both privileged and burdened text processing.
Right. And it wasn't only Python: most languages created around or before that time had the same issues (perhaps starting with C's use of "char" meaning byte). Even most IP protocols developed in the 1990s confuse character set and encoding (witness HTTP's "Content-type: text/plain; charset=utf-8"). I'm glad in Python 3 we undertook to improve the distinction. -- --Guido van Rossum (python.org/~guido)
On 6/2/2011 1:58 PM, Guido van Rossum wrote:
On Wed, Jun 1, 2011 at 11:30 PM, Terry Reedy<tjreedy@udel.edu> wrote:
The confusion of character with byte in the original design of Python both privileged and burdened text processing.
Right. And it wasn't only Python: most languages created around or before that time had the same issues (perhaps starting with C's use of "char" meaning byte). Even most IP protocols developed in the 1990s confuse character set and encoding (witness HTTP's "Content-type: text/plain; charset=utf-8").
I hold Python to a higher standard. But yes, that is badly confused.
I'm glad in Python 3 we undertook to improve the distinction.
I am a bit embarassed that I did not see sooner that characters are for people and bytes for computers. Thus Python produces both character and byte serializations for objects. On the coding front: when I first did statistics on computers (1970s), all data were coded with numbers. For instance, Sex: male = 1; female = 2; unknown = 9. In the 1980s, we could use letters (which became ascii codes): male = 'm'; female = 'f'; unknown = ' '. For a US-only project, this seemed like an advance. So I though then. For a global project, it would have been the opposite. For a Spanish speaker, 'm' might seem to mean 'mujer' (woman). For many others around the world, euro-indic digits are more familiar and easier to read than latin letters. I am less ethnocentric now. I'm glad Python has become more of a global language, even if English based. -- Terry Jan Reedy
On Fri, Jun 3, 2011 at 6:14 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I am a bit embarassed that I did not see sooner that characters are for people and bytes for computers. Thus Python produces both character and byte serializations for objects.
FWIW, even after being involved in the assorted bytes/str design discussions for Py3k, I didn't really "get it" myself until I made the changes to urllib.parse in Python 3.2 to get most of the APIs to accept both str objects and byte sequences. The contrast between my first attempt (which tried to provide a common code path that handled both strings and byte sequences without trashing the encoding of the latter) and my second (which just decodes and reencodes byte sequences using strict ASCII and punts on malformed URLs containing non-ASCII values) was amazing. My original plan was to benchmark them before choosing, but the latter approach was so much simpler and cleaner than the former that it wasn't even a contest. Focusing efforts on things like PEP 393, and perhaps even a memoryview based "strview" is likely to be a more fruitful way forward than trying to shoehorn text-specific concerns into the general binary storage types (and, as noted, the long release cycle means the standard library is the wrong place for that kind of experimentation). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (16)
-
Bill Janssen
-
Carl M. Johnson
-
Eric Smith
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
INADA Naoki
-
M.-A. Lemburg
-
Masklinn
-
MRAB
-
Nick Coghlan
-
Paul Colomiets
-
Raymond Hettinger
-
Stefan Behnel
-
Stephen J. Turnbull
-
Terry Reedy