Parameters of str(), bytes() and bytearray()
Currently str() takes up to 3 arguments. All are optional and positional-or-keyword. All combinations are valid: str() str(object=object) str(object=buffer, encoding=encoding) str(object=buffer, errors=errors) str(object=buffer, encoding=encoding, errors=errors) str(encoding=encoding) str(errors=errors) str(encoding=encoding, errors=errors) The last three are especially surprising. If you do not specify an object, str() ignores values of encoding and errors and returns an empty string. bytes() and bytearray() are more limited. Valid combinations are: bytes() bytes(source=object) bytes(source=string, encoding=encoding) bytes(source=string, encoding=encoding, errors=errors) I propose several changes: 1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period. 2. Make the first parameter of str(), bytes() and bytearray() positional-only. Originally this feature was an implementation artifact: before 3.6 parameters of a C implemented function should be either all positional-only (if used PyArg_ParseTuple), or all keyword (if used PyArg_ParseTupleAndKeywords). So str(), bytes() and bytearray() accepted the first parameter by keyword. We already made similar changes for int(), float(), etc: int(x=42) no longer works. Unlikely str(object=object) is used in a real code, so we can skip a deprecation period for this change too. 3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__.
I bet someone in the world has written code like: foo = str(**dynamic-args()) And therefore, disabling "silly" combinations of arguments will break their code occasionally. On Sun, Dec 15, 2019, 9:09 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
Currently str() takes up to 3 arguments. All are optional and positional-or-keyword. All combinations are valid:
str() str(object=object) str(object=buffer, encoding=encoding) str(object=buffer, errors=errors) str(object=buffer, encoding=encoding, errors=errors) str(encoding=encoding) str(errors=errors) str(encoding=encoding, errors=errors)
The last three are especially surprising. If you do not specify an object, str() ignores values of encoding and errors and returns an empty string.
bytes() and bytearray() are more limited. Valid combinations are:
bytes() bytes(source=object) bytes(source=string, encoding=encoding) bytes(source=string, encoding=encoding, errors=errors)
I propose several changes:
1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
2. Make the first parameter of str(), bytes() and bytearray() positional-only. Originally this feature was an implementation artifact: before 3.6 parameters of a C implemented function should be either all positional-only (if used PyArg_ParseTuple), or all keyword (if used PyArg_ParseTupleAndKeywords). So str(), bytes() and bytearray() accepted the first parameter by keyword. We already made similar changes for int(), float(), etc: int(x=42) no longer works.
Unlikely str(object=object) is used in a real code, so we can skip a deprecation period for this change too.
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YMIGWRUE... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Dec 16, 2019 at 4:06 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
15.12.19 16:30, David Mertz пише:
I bet someone in the world has written code like:
foo = str(**dynamic_args())
And therefore, disabling "silly" combinations of arguments will break their code occasionally.
Do you have real world examples?
I do not! It wasn't me who wrote it :-). I was really replying to the claim that there was definitely no code in the world the proposed change would break. I think that claim is almost surely false. But maybe it's little enough code that it's worth it (but I think deprecation period is needed still). -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
Serhiy Storchaka wrote:
Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
+1, I suspect that nobody would intentionally pass an argument to the encoding and/or errors parameter(s) without specifying an object. Returning an empty string from this seems like it would cover up bugs rather than be useful in any capacity. Serhiy Storchaka wrote:
2. Make the first parameter of str(), bytes() and bytearray() positional-only.
+1, I don't think I've ever seen a single instance of code that passes the first parameter, *object*, as a kwarg: str(object=obj). As long as the other two parameters, *encoding* and *error*, remain keyword arguments, I think this would make sense. Serhiy Storchaka wrote:
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__.
Hmm, I think this one might require some further consideration. But I will say that the implicit behavior is not very obvious. Isn't overly clear, implicit 'utf-8' conversion:
str(b'\xc3\xa1', errors='strict') 'á'
Makes sense, and is highly explicit:
str(b'\xc3\xa1', encoding='utf-8', errors='strict') 'á'
This is also fine ('strict' is a very reasonable default for *errors*)
str(b'\xc3\xa1', encoding='utf-8') 'á'
On a related note though, I'm not a fan of this behavior:
str(b'\xc3\xa1') "b'\\xc3\\xa1'"
Passing a bytes object to str() without specifying an encoding seems like a mistake, I honestly don't see how this ("b'\\xc3\\xa1'") would even be useful in any capacity. I would expect this to instead raise a TypeError, similar to passing a string to bytes() without specifying an encoding:
bytes('á') ... TypeError: string argument without an encoding
I'd much prefer to see something like this:
str(b'\xc3\xa1') ... TypeError: bytes argument without an encoding
Is there some use case for returning "b'\\xc3\\xa1'" from this operation that I'm not seeing? To me, it seems equally, if not more confusing and pointless than returning an empty string from str(errors='strict') or some other combination of *errors* and *encoding* kwargs without passing an object. On Sun, Dec 15, 2019 at 9:10 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
Currently str() takes up to 3 arguments. All are optional and positional-or-keyword. All combinations are valid:
str() str(object=object) str(object=buffer, encoding=encoding) str(object=buffer, errors=errors) str(object=buffer, encoding=encoding, errors=errors) str(encoding=encoding) str(errors=errors) str(encoding=encoding, errors=errors)
The last three are especially surprising. If you do not specify an object, str() ignores values of encoding and errors and returns an empty string.
bytes() and bytearray() are more limited. Valid combinations are:
bytes() bytes(source=object) bytes(source=string, encoding=encoding) bytes(source=string, encoding=encoding, errors=errors)
I propose several changes:
1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
2. Make the first parameter of str(), bytes() and bytearray() positional-only. Originally this feature was an implementation artifact: before 3.6 parameters of a C implemented function should be either all positional-only (if used PyArg_ParseTuple), or all keyword (if used PyArg_ParseTupleAndKeywords). So str(), bytes() and bytearray() accepted the first parameter by keyword. We already made similar changes for int(), float(), etc: int(x=42) no longer works.
Unlikely str(object=object) is used in a real code, so we can skip a deprecation period for this change too.
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YMIGWRUE... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Dec 16, 2019 at 12:00 PM Kyle Stanley <aeros167@gmail.com> wrote:
On a related note though, I'm not a fan of this behavior:
str(b'\xc3\xa1') "b'\\xc3\\xa1'"
Passing a bytes object to str() without specifying an encoding seems like a mistake, I honestly don't see how this ("b'\\xc3\\xa1'") would even be useful in any capacity. I would expect this to instead raise a TypeError, similar to passing a string to bytes() without specifying an encoding:
bytes('á') ... TypeError: string argument without an encoding
I'd much prefer to see something like this:
str(b'\xc3\xa1') ... TypeError: bytes argument without an encoding
Is there some use case for returning "b'\\xc3\\xa1'" from this operation that I'm not seeing? To me, it seems equally, if not more confusing and pointless than returning an empty string from str(errors='strict') or some other combination of *errors* and *encoding* kwargs without passing an object.
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text". With an actual bytes object, I always prefer b.decode(...) to str(b, encoding=...). But the one-arg form of str() needs to be able to represent a bytes object in some way, just as it can represent an int, a Fraction, or a list. ChrisA
Chris Angelico wrote:
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text".
While it does make sense for str() to be able to give some form of printable form for any object, I suppose that I just don't consider something like this: "b'\\xc3\\xa1'" to be overly useful, at least for any practical purposes. Can anyone think of a situation where you would want a string representation of a bytes object instead of decoding it? If not, I think it would be more useful for it to either: 1) Raise a TypeError, assume that the user wanted to decode the string but forgot to specify an encoding 2) Implicitly decode the bytes object as UTF-8, assume the user meant str(bytes_obj, encoding='utf-8') Personally, I'm more in favor of (1) since it's much more explicit and obvious, but I think (2) would at least be more useful than the current behavior. On Sun, Dec 15, 2019 at 8:13 PM Chris Angelico <rosuav@gmail.com> wrote:
On a related note though, I'm not a fan of this behavior:
str(b'\xc3\xa1') "b'\\xc3\\xa1'"
Passing a bytes object to str() without specifying an encoding seems
bytes('á') ... TypeError: string argument without an encoding
I'd much prefer to see something like this:
str(b'\xc3\xa1') ... TypeError: bytes argument without an encoding
Is there some use case for returning "b'\\xc3\\xa1'" from this operation
On Mon, Dec 16, 2019 at 12:00 PM Kyle Stanley <aeros167@gmail.com> wrote: like a mistake, I honestly don't see how this ("b'\\xc3\\xa1'") would even be useful in any capacity. I would expect this to instead raise a TypeError, similar to passing a string to bytes() without specifying an encoding: that I'm not seeing? To me, it seems equally, if not more confusing and pointless than returning an empty string from str(errors='strict') or some other combination of *errors* and *encoding* kwargs without passing an object.
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text". With an actual bytes object, I always prefer b.decode(...) to str(b, encoding=...). But the one-arg form of str() needs to be able to represent a bytes object in some way, just as it can represent an int, a Fraction, or a list.
ChrisA _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZP7SXIDQ... Code of Conduct: http://python.org/psf/codeofconduct/
On 12/16/2019 12:05 AM, Kyle Stanley wrote:
Chris Angelico wrote:
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text".
While it does make sense for str() to be able to give some form of printable form for any object, I suppose that I just don't consider something like this: "b'\\xc3\\xa1'" to be overly useful, at least for any practical purposes. Can anyone think of a situation where you would want a string representation of a bytes object instead of decoding it?
Binary data
On 12/16/2019 3:05 AM, Kyle Stanley wrote:
Chris Angelico wrote:
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text".
While it does make sense for str() to be able to give some form of printable form for any object, I suppose that I just don't consider something like this: "b'\\xc3\\xa1'" to be overly useful, at least for any practical purposes. Can anyone think of a situation where you would want a string representation of a bytes object instead of decoding it?
Debugging. I sometimes do things like: print('\n'.join(str(thing) for thing in lst)), or various variations on this. This is especially useful when maybe something in the list is a bytes object where I was expecting a string. I'm not saying it's the best practice, but calling str() on an object is a currently a guaranteed way of making a string out of it, and I don't think we can change it. Eric
Debugging. I sometimes do things like: print('\n'.join(str(thing) for
Eric V. Smith wrote: thing in lst)), or various variations on this. This is especially useful > when maybe something in the list is a bytes object where I was expecting a string.
I'm not saying it's the best practice, but calling str() on an object is
a currently a guaranteed way of making a string out of it, and I
don't think we can change it.
I could see that being useful actually. Regardless of "best practices", it's reasonably common to indiscriminately convert a large sequence of objects into strings for basic inspection purposes. There may be better means of debugging, but I wouldn't want to prevent that option entirely for bytes objects. But, I suspect that backwards compatibility might be too much of a concern here for the change to be worthwhile either way. Adding the TypeError or even gradual deprecation would more than likely lead to a decent amount of code breakage and maintenance; and changing it to implicitly perform a UTF-8 encoding would very likely cause some confusion and debugging difficulties for those who frequently inspect via string conversion. Thanks for the insight. On Mon, Dec 16, 2019 at 3:43 AM Eric V. Smith <eric@trueblade.com> wrote:
On 12/16/2019 3:05 AM, Kyle Stanley wrote:
Chris Angelico wrote:
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text".
While it does make sense for str() to be able to give some form of printable form for any object, I suppose that I just don't consider something like this: "b'\\xc3\\xa1'" to be overly useful, at least for any practical purposes. Can anyone think of a situation where you would want a string representation of a bytes object instead of decoding it?
Debugging. I sometimes do things like: print('\n'.join(str(thing) for thing in lst)), or various variations on this. This is especially useful when maybe something in the list is a bytes object where I was expecting a string.
I'm not saying it's the best practice, but calling str() on an object is a currently a guaranteed way of making a string out of it, and I don't think we can change it.
Eric
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5B46FTPO... Code of Conduct: http://python.org/psf/codeofconduct/
16.12.19 10:34, Eric V. Smith пише:
On 12/16/2019 3:05 AM, Kyle Stanley wrote:
Chris Angelico wrote:
ANY object can be passed to str() in order to get some sort of valid printable form. The awkwardness comes from the fact that str() performs double duty - it's both "give me a printable form of this object" and "decode these bytes into text".
While it does make sense for str() to be able to give some form of printable form for any object, I suppose that I just don't consider something like this: "b'\\xc3\\xa1'" to be overly useful, at least for any practical purposes. Can anyone think of a situation where you would want a string representation of a bytes object instead of decoding it?
Debugging. I sometimes do things like: print('\n'.join(str(thing) for thing in lst)), or various variations on this. This is especially useful when maybe something in the list is a bytes object where I was expecting a string.
I usually create a list: print([a, b, c]) It guarantees that repr() be used instead of str(). It also makes the debug output more distinguishable from normal output. I use %r or !r when include an arbitrary object in logging or error messages. It is safer for several reasons. But I agree that making str() failing for bytes can break a lot of existing code.
16.12.19 02:55, Kyle Stanley пише:
I'd much prefer to see something like this:
str(b'\xc3\xa1') ... TypeError: bytes argument without an encoding
Is there some use case for returning "b'\\xc3\\xa1'" from this operation that I'm not seeing? To me, it seems equally, if not more confusing and pointless than returning an empty string from str(errors='strict') or some other combination of *errors* and *encoding* kwargs without passing an object.
It is not more confusing that returning "<Foo object at 0x1234abcd>". By default str() returns the same as repr(), unless we made the object having other string representation. You can get an error here if you run Python with -bb. This is a temporary option to catch common errors of porting from Python 2.
Serhiy Storchaka wrote:
It is not more confusing that returning "<Foo object at 0x1234abcd>". By default str() returns the same as repr(), unless we made the object having other string representation.
Yeah, I suppose not. But that does raise of question of why bytes objects were made to have a specific form of string representation in the first place, instead of the generic object address repr. I suspect that it might be for historical or arbitrary reasons. But, that's likely an entirely different topic. I'll leave it at that so I don't derail the main topic. Serhiy Storchaka wrote:
You can get an error here if you run Python with -bb. This is a temporary option to catch common errors of porting from Python 2.
Huh, interesting. On Mon, Dec 16, 2019 at 3:59 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
16.12.19 02:55, Kyle Stanley пише:
I'd much prefer to see something like this:
str(b'\xc3\xa1') ... TypeError: bytes argument without an encoding
Is there some use case for returning "b'\\xc3\\xa1'" from this operation that I'm not seeing? To me, it seems equally, if not more confusing and pointless than returning an empty string from str(errors='strict') or some other combination of *errors* and *encoding* kwargs without passing an object.
It is not more confusing that returning "<Foo object at 0x1234abcd>". By default str() returns the same as repr(), unless we made the object having other string representation.
You can get an error here if you run Python with -bb. This is a temporary option to catch common errors of porting from Python 2. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RRG4Q7BQ... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Dec 15, 2019 at 11:07 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
I propose several changes:
1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
2. Make the first parameter of str(), bytes() and bytearray() positional-only. Originally this feature was an implementation artifact: before 3.6 parameters of a C implemented function should be either all positional-only (if used PyArg_ParseTuple), or all keyword (if used PyArg_ParseTupleAndKeywords). So str(), bytes() and bytearray() accepted the first parameter by keyword. We already made similar changes for int(), float(), etc: int(x=42) no longer works.
Unlikely str(object=object) is used in a real code, so we can skip a deprecation period for this change too.
+1 for 1 and 2.
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__.
-0. We can omit `encoding="utf-8"` in bytes.decode() because the default encoding is always UTF-8.
x = "おはよう".encode() x.decode(errors="strict") 'おはよう'
So allowing `bytes(o, errors="replace")` instead of making encoding mandatory also makes sense to me. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Mon, Dec 16, 2019 at 6:25 PM Inada Naoki <songofacandy@gmail.com> wrote:
+1 for 1 and 2.
If we find it broke some software, we can step back to regular deprecation workflow. Python 3.9 is still far from beta yet. That's why I'm +1 on these proposals. -- Inada Naoki <songofacandy@gmail.com>
If we find it broke some software, we can step back to regular deprecation workflow. Python 3.9 is still far from beta yet. That's why I'm +1 on these
Inada Naoki wrote: proposals. IMO, since this would be changing a builtin function, we should at least use a version+2 deprecation cycle (in this case, removal in 3.11) regardless of reported breakages. Especially if there's no _substantial_ security, efficiency, or performance reason for immediate prevention of str() without passing an object (while specifying *encoding* and/or *error) or making *object* a positional only argument. On Mon, Dec 16, 2019 at 4:31 AM Inada Naoki <songofacandy@gmail.com> wrote:
On Mon, Dec 16, 2019 at 6:25 PM Inada Naoki <songofacandy@gmail.com> wrote:
+1 for 1 and 2.
If we find it broke some software, we can step back to regular deprecation workflow. Python 3.9 is still far from beta yet. That's why I'm +1 on these proposals.
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HWNLBBHS... Code of Conduct: http://python.org/psf/codeofconduct/
Kyle Stanley wrote:
or making *object* a positional only argument.
Typo: I meant "positional only parameter", not "argument". On Mon, Dec 16, 2019 at 4:39 AM Kyle Stanley <aeros167@gmail.com> wrote:
If we find it broke some software, we can step back to regular deprecation workflow. Python 3.9 is still far from beta yet. That's why I'm +1 on these
Inada Naoki wrote: proposals.
IMO, since this would be changing a builtin function, we should at least use a version+2 deprecation cycle (in this case, removal in 3.11) regardless of reported breakages.
Especially if there's no _substantial_ security, efficiency, or performance reason for immediate prevention of str() without passing an object (while specifying *encoding* and/or *error) or making *object* a positional only argument.
On Mon, Dec 16, 2019 at 4:31 AM Inada Naoki <songofacandy@gmail.com> wrote:
On Mon, Dec 16, 2019 at 6:25 PM Inada Naoki <songofacandy@gmail.com> wrote:
+1 for 1 and 2.
If we find it broke some software, we can step back to regular deprecation workflow. Python 3.9 is still far from beta yet. That's why I'm +1 on these proposals.
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HWNLBBHS... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Dec 15, 2019 at 6:09 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
Currently str() takes up to 3 arguments. All are optional and positional-or-keyword. All combinations are valid:
str() str(object=object) str(object=buffer, encoding=encoding) str(object=buffer, errors=errors) str(object=buffer, encoding=encoding, errors=errors) str(encoding=encoding) str(errors=errors) str(encoding=encoding, errors=errors)
The last three are especially surprising. If you do not specify an object, str() ignores values of encoding and errors and returns an empty string.
bytes() and bytearray() are more limited. Valid combinations are:
bytes() bytes(source=object) bytes(source=string, encoding=encoding) bytes(source=string, encoding=encoding, errors=errors)
I propose several changes:
1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
What problem are you trying to solve with this proposal? I am only -0 on this, but I am wondering why bother with the churn.
2. Make the first parameter of str(), bytes() and bytearray() positional-only. Originally this feature was an implementation artifact: before 3.6 parameters of a C implemented function should be either all positional-only (if used PyArg_ParseTuple), or all keyword (if used PyArg_ParseTupleAndKeywords). So str(), bytes() and bytearray() accepted the first parameter by keyword. We already made similar changes for int(), float(), etc: int(x=42) no longer works.
I am +1 on this. Your reasoning is spot on. (Note that str() must work -- all builtin types can be called without arguments and will return a "zero" element of the right type.)
Unlikely str(object=object) is used in a real code, so we can skip a deprecation period for this change too.
Likely.
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__.
I'm -0 on this. It seems that the presence of either errors= or encoding= causes str() to switch to "decode bytes" semantics, and a default decoding of UTF-8. That default makes sense: UTF-8 is our default source encoding, and we are trending to use it as the default in other places. I doubt that such calls would confuse anyone. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
16.12.19 18:35, Guido van Rossum пише:
On Sun, Dec 15, 2019 at 6:09 AM Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:
1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
What problem are you trying to solve with this proposal? I am only -0 on this, but I am wondering why bother with the churn.
Initially I wanted to check the documentation and the docstrings of str() and fix it if needed. It was inspired by the Discourse topic [1]. I have found that in contrary to the OP's claim the documentation is correct, but the docstring is not. The documentation is correct (because Chris Jerdonek accurately documented the actual behavior in 2012 [2]), but ambiguous. str(object='') str(object=b'', encoding='utf-8', errors='strict') 0- and 1-argument calls match both signatures. Also it implies that str(encoding='ascii') and str(errors='ignore') are valid, and this is true! And more, str(encoding='spam') and str(errors='ham') are valid too, because the values of encoding and errors are ignored. I cannot imagine a use case for this. It looks like an implementation artifact. The docstring is left not fixed. str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str It uses different names for the first parameter (it would not matter if it would be positional-only), it requires bytes_or_buffer for decoding, it requires encoding if errors is passed. So my goal is to remove glitches which are not used in a real code in any case, and make the behavior closer to the initial intention. If apply all three my proposition, signatures would look like: str(object='', /) -> str str(bytes_or_buffer, /, encoding, errors='strict') -> str Almost the same as for bytes: bytes(object=b'', /) -> bytes bytes(string, /, encoding, errors='strict') -> bytes [1] https://discuss.python.org/t/str-mybytes-wrong-docs/2866 [2] https://bugs.python.org/issue13538
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar to bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a bytes-like object, otherwise we convert an object to a string using __str__.
I'm -0 on this. It seems that the presence of either errors= or encoding= causes str() to switch to "decode bytes" semantics, and a default decoding of UTF-8. That default makes sense: UTF-8 is our default source encoding, and we are trending to use it as the default in other places. I doubt that such calls would confuse anyone.
This proposition is the one about which I am not sure. On one side, the bytes() constructor requires encoding for decoding. On other side, it is optional in str.encode() and bytes.decode(). But str.encode() and bytes.decode() have only one function, so you can omit both encoding and errors without ambiguity. If we allow str(bytes_or_buffer, errors=errors), should not we allow also bytes(string, errors=errors)?
On Mon, Dec 16, 2019 at 12:04 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
16.12.19 18:35, Guido van Rossum пише:
On Sun, Dec 15, 2019 at 6:09 AM Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:
1. Forbids calling str() without object if encoding or errors are specified. It is very unlikely that this can break a real code, so I propose to make it an error without a deprecation period.
What problem are you trying to solve with this proposal? I am only -0 on this, but I am wondering why bother with the churn.
Initially I wanted to check the documentation and the docstrings of str() and fix it if needed. It was inspired by the Discourse topic [1]. I have found that in contrary to the OP's claim the documentation is correct, but the docstring is not.
So let's fix the docstring. The documentation is correct (because Chris Jerdonek accurately
documented the actual behavior in 2012 [2]), but ambiguous.
str(object='') str(object=b'', encoding='utf-8', errors='strict')
Honestly this notation leaves a lot unsaid. Apparently the first form allows `object` to have any type, while the second only allows it to be bytes (or bytearray, or memoryview, or presumably anything that supports the buffer protocol?). And it appears unnecessary to specify a default in the first case -- then the 0-args form would only match the second pattern.
0- and 1-argument calls match both signatures. Also it implies that str(encoding='ascii') and str(errors='ignore') are valid, and this is true!
And the docs spell this out clearly enough that I don't see any reason to change it. This is a function that is *so* common that *any* tweak we make to it will break someone's code.
And more, str(encoding='spam') and str(errors='ham') are valid too, because the values of encoding and errors are ignored. I cannot imagine a use case for this. It looks like an implementation artifact.
But again one that we can't change. At least for errors='ham', this seems to be the case for all encoding/decoding functions -- the error handler is looked up lazily, and an empty input string doesn't need it. b''.decode(errors="ham") acts the same way. In fact, it's the same for b.decode(encoding='spam'). So str() is not special here, and I recommend keeping it that way.
The docstring is left not fixed.
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
It uses different names for the first parameter (it would not matter if it would be positional-only), it requires bytes_or_buffer for decoding, it requires encoding if errors is passed.
So my goal is to remove glitches which are not used in a real code in any case, and make the behavior closer to the initial intention. If apply all three my proposition, signatures would look like:
str(object='', /) -> str str(bytes_or_buffer, /, encoding, errors='strict') -> str
Almost the same as for bytes:
bytes(object=b'', /) -> bytes bytes(string, /, encoding, errors='strict') -> bytes
bytes() and str() just aren't each other's opposite -- bytes() really only takes str input, but str() takes any input. So there's always going to be a discrepancy. I now think the current behavior should not change.
[1] https://discuss.python.org/t/str-mybytes-wrong-docs/2866 [2] https://bugs.python.org/issue13538
3. Make encoding required if errors is specified in str(). This will reduce the number of possible combinations, makes str() more similar
to
bytes() and bytearray() and simplify the mental model: if encoding is specified, then we decode, and the first argument must be a
bytes-like
object, otherwise we convert an object to a string using __str__.
I'm -0 on this. It seems that the presence of either errors= or encoding= causes str() to switch to "decode bytes" semantics, and a default decoding of UTF-8. That default makes sense: UTF-8 is our default source encoding, and we are trending to use it as the default in other places. I doubt that such calls would confuse anyone.
This proposition is the one about which I am not sure. On one side, the bytes() constructor requires encoding for decoding. On other side, it is optional in str.encode() and bytes.decode(). But str.encode() and bytes.decode() have only one function, so you can omit both encoding and errors without ambiguity.
If we allow str(bytes_or_buffer, errors=errors), should not we allow also bytes(string, errors=errors)?
Not necessarily. There's an old saying in PEP 8 about foolish consistency... -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
The docstring is left not fixed.
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
I noticed this too; the doc and docstring should be made to agree with each other and the code. While exploring the actual behavior, I discovered that while the presence of encoding triggers decoding of bytes, it is not needed and hence not checked for null bytes. Hence an invalid encoding is OK in this edge case.
b''.decode('0') '' str(b'','0') '' str(b'') "b''"
Should this be at least tested if not documented? (So that other implementations know to check the bytes value before the encoding value?) -- Terry Jan Reedy
participants (9)
-
Chris Angelico
-
David Mertz
-
Eric V. Smith
-
Glenn Linderman
-
Guido van Rossum
-
Inada Naoki
-
Kyle Stanley
-
Serhiy Storchaka
-
Terry Reedy