While I was implementing JSON-JWS (JSON web signatures), a format which in Python 3 has to go from bytes > unicode > bytes > unicode several times in its construction, I notice I wrote a lot of bugs: "sha256=b'abcdef1234'" When I meant to say: "sha256=abcdef1234" Everything worked perfectly on Python 3 because the verifying code also generated the sha256=b'abcdef1234' as a comparison. I would have never noticed at all unless I had tried to verify the Python 3 output with Python 2. I know I'm a bad person for not having unit tests capable enough to catch this bug, a bug I wrote repeatedly in each layer of the bytes > unicode > bytes > unicode dance, and that there is no excuse for being confused at any time about the type of a variable, but I'm not willing to reform. Instead, I would like a new string formatting operator tentatively called 'notbytes': "sha256=%notbytes" % (b'abcdef1234'). It gives the same error as 'sha256='+b'abc1234' would: TypeError: Can't convert 'bytes' object to str implictly
On Fri, 24 Aug 2012 13:26:49 -0400 Daniel Holth <dholth@gmail.com> wrote:
While I was implementing JSON-JWS (JSON web signatures), a format which in Python 3 has to go from bytes > unicode > bytes > unicode several times in its construction, I notice I wrote a lot of bugs:
"sha256=b'abcdef1234'"
When I meant to say:
"sha256=abcdef1234"
Everything worked perfectly on Python 3 because the verifying code also generated the sha256=b'abcdef1234' as a comparison. I would have never noticed at all unless I had tried to verify the Python 3 output with Python 2.
You can use the -bb flag to raise BytesWarnings in such cases: $ python3 -bb Python 3.2.2+ (3.2:9ef20fbd340f, Oct 15 2011, 21:22:07) [GCC 4.5.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
str(b'foo') Traceback (most recent call last): File "<stdin>", line 1, in <module> BytesWarning: str() on a bytes instance "%s" % (b'foo',) Traceback (most recent call last): File "<stdin>", line 1, in <module> BytesWarning: str() on a bytes instance "{}".format(b'foo') Traceback (most recent call last): File "<stdin>", line 1, in <module> BytesWarning: str() on a bytes instance
Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
On 24/08/2012 18:26, Daniel Holth wrote:
While I was implementing JSON-JWS (JSON web signatures), a format which in Python 3 has to go from bytes > unicode > bytes > unicode several times in its construction, I notice I wrote a lot of bugs:
"sha256=b'abcdef1234'"
When I meant to say:
"sha256=abcdef1234"
Everything worked perfectly on Python 3 because the verifying code also generated the sha256=b'abcdef1234' as a comparison. I would have never noticed at all unless I had tried to verify the Python 3 output with Python 2.
I know I'm a bad person for not having unit tests capable enough to catch this bug, a bug I wrote repeatedly in each layer of the bytes > unicode > bytes > unicode dance, and that there is no excuse for being confused at any time about the type of a variable, but I'm not willing to reform.
Instead, I would like a new string formatting operator tentatively called 'notbytes': "sha256=%notbytes" % (b'abcdef1234'). It gives the same error as 'sha256='+b'abc1234' would: TypeError: Can't convert 'bytes' object to str implictly
Why are you singling out 'bytes'? The "%s" format specifier (or "{:s}" with the .format method) will accept a whole range of values, including ints and lists, which, when concatenated, will raise a TypeError. Why should 'bytes' be different? There _are_ certain number-only formats, so perhaps what you should be asking for is a string-only format.
String only would be perfect. I only single out bytes because they are more like strings than any other type.
On Fri, 24 Aug 2012 14:33:48 -0400 Daniel Holth <dholth@gmail.com> wrote:
String only would be perfect. I only single out bytes because they are more like strings than any other type.
You can use concatenation instead of (or in addition to) formatting:
"" + "foo" 'foo' "" + b"foo" Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'bytes' object to str implicitly "" + 42 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly
Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
Yes, if I wanted to pretend I was using JavaScript. A string-only formatter might cause problems with translation string / gettext type objects? On Fri, Aug 24, 2012 at 2:40 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Fri, 24 Aug 2012 14:33:48 -0400 Daniel Holth <dholth@gmail.com> wrote:
String only would be perfect. I only single out bytes because they are more like strings than any other type.
You can use concatenation instead of (or in addition to) formatting:
"" + "foo" 'foo' "" + b"foo" Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'bytes' object to str implicitly "" + 42 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly
Regards
Antoine.
-- Software development and contracting: http://pro.pitrou.net
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Fri, 24 Aug 2012 14:57:08 -0400 Daniel Holth <dholth@gmail.com> wrote:
Yes, if I wanted to pretend I was using JavaScript.
???
A string-only formatter might cause problems with translation string / gettext type objects?
The question is rather: is it worth it? We certainly don't want to create formatters for every existing use case. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
On 25/08/12 04:57, Daniel Holth wrote:
Yes, if I wanted to pretend I was using JavaScript.
I'm not entirely sure what you are responding to here -- the context is lost when you top post like that. I'm *guessing* that you are responding to Antoine's advice to use concatenation. If so, why do you think that concatenation is "pretending" to be using Javascript? It is a perfectly valid operation in Python, and many languages which predate Javascript include concatenation.
A string-only formatter might cause problems with translation string / gettext type objects?
On Fri, Aug 24, 2012 at 2:40 PM, Antoine Pitrou<solipsis@pitrou.net> wrote:
On Fri, 24 Aug 2012 14:33:48 -0400 Daniel Holth<dholth@gmail.com> wrote:
String only would be perfect. I only single out bytes because they are more like strings than any other type.
You can use concatenation instead of (or in addition to) formatting:
"" + "foo" 'foo' "" + b"foo" Traceback (most recent call last): File "<stdin>", line 1, in<module> TypeError: Can't convert 'bytes' object to str implicitly "" + 42 Traceback (most recent call last): File "<stdin>", line 1, in<module> TypeError: Can't convert 'int' object to str implicitly
Regards
Antoine.
-- Steven
On Fri, Aug 24, 2012 at 3:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On 25/08/12 04:57, Daniel Holth wrote:
Yes, if I wanted to pretend I was using JavaScript.
I'm not entirely sure what you are responding to here -- the context is lost when you top post like that. I'm *guessing* that you are responding to Antoine's advice to use concatenation.
I am only trying to say that I like using the string formatting operations and I think I am justified in using them instead of concatenation. I was merely surprised by the implicit bytes to "b'string'" conversion, and would like to be able to turn it off. I do rather enjoy programming in JavaScript, even though its strings do not have a .format() method.
If so, why do you think that concatenation is "pretending" to be using Javascript? It is a perfectly valid operation in Python, and many languages which predate Javascript include concatenation.
On 24 August 2012 20:21, Daniel Holth <dholth@gmail.com> wrote:
I was merely surprised by the implicit bytes to "b'string'" conversion, and would like to be able to turn it off.
The conversion is not really "implicit". It's precisely what the %s (or {!s}) conversion format *explicitly* requests - insert the str() of the supplied argument at this point in the output string. See library reference 6.1.3 "Format String Syntax" (I don't know if there's an equivalent description for % formatting). If you want to force an argument to be a string, you could always do something like this: def must_be_str(s): if isinstance(s, str): return s raise ValueError x = "The value is {}".format(must_be_str(s)) There's no "only insert a string here, raise an error for other types" format specifier, largely because formatting is in principle about *formatting* - converting other types to strings. In practice, most of my uses of formatting (and I suspect many other people's) is more about interpolation - inserting chunks of text into templates. For that application, a stricter form could be more useful, I guess. I could see value in a {!S} conversion specifier (in the terminology of library reference 6.1.3 "Format String Syntax") which overrode __format__ with a conversion function equivalent to must_be_str above. But I don't know if it would get much use (anyone careful enough to use it is probably careful enough of their types to not need it). Also, is it *really* what you want? Did your code accidentally pass bytes to a {!s} formatter, and yet *never* pass a number and get the right result? Or conversely, would you be willing to audit all your conversions to be sure that numbers were never passed, and yet *still* not be willing to ensure you have no bytes/str confusion? (Although as your use case was encode/decode dances, maybe bytes really are sufficiently special in your code - but I'd argue that needing to address this issue implies that you have some fairly subtle bugs in your encoding process that you should be fixing before worrying about this). Paul
On Fri, Aug 24, 2012 at 4:03 PM, Paul Moore <p.f.moore@gmail.com> wrote:
On 24 August 2012 20:21, Daniel Holth <dholth@gmail.com> wrote:
I was merely surprised by the implicit bytes to "b'string'" conversion, and would like to be able to turn it off.
The conversion is not really "implicit". It's precisely what the %s (or {!s}) conversion format *explicitly* requests - insert the str() of the supplied argument at this point in the output string. See library reference 6.1.3 "Format String Syntax" (I don't know if there's an equivalent description for % formatting).
If you want to force an argument to be a string, you could always do something like this:
def must_be_str(s): if isinstance(s, str): return s raise ValueError
x = "The value is {}".format(must_be_str(s))
There's no "only insert a string here, raise an error for other types" format specifier, largely because formatting is in principle about *formatting* - converting other types to strings. In practice, most of my uses of formatting (and I suspect many other people's) is more about interpolation - inserting chunks of text into templates. For that application, a stricter form could be more useful, I guess.
I could see value in a {!S} conversion specifier (in the terminology of library reference 6.1.3 "Format String Syntax") which overrode __format__ with a conversion function equivalent to must_be_str above. But I don't know if it would get much use (anyone careful enough to use it is probably careful enough of their types to not need it).
Also, is it *really* what you want? Did your code accidentally pass bytes to a {!s} formatter, and yet *never* pass a number and get the right result? Or conversely, would you be willing to audit all your conversions to be sure that numbers were never passed, and yet *still* not be willing to ensure you have no bytes/str confusion? (Although as your use case was encode/decode dances, maybe bytes really are sufficiently special in your code - but I'd argue that needing to address this issue implies that you have some fairly subtle bugs in your encoding process that you should be fixing before worrying about this).
Hi Paul! You could probably guess that this is the wheel digital signatures package. All the string formatting arguments (I hope) are now passed to binary() or native() string conversion functions that do less on Python 2.7 than on Python 3. Yes, I would be willing to audit my code to ensure that numbers were never passed. I am already calling .encode() and .decode() on most objects in this pipeline. In my opinion int-when-usually-str is in most cases as likely to be a bug as getting bytes() when you expect str(). Python even has the -bb argument to help with this thing that is almost never the right thing to do. How often does anyone who is not writing a REPL ever expect "%s" % bytes() to produce b''? In this particular case I could also make my life a lot easier by extending the JSON serializer to accept bytes(), but I suppose I would lose the string formatting operations.
On 24 August 2012 21:21, Daniel Holth <dholth@gmail.com> wrote:
Hi Paul! You could probably guess that this is the wheel digital signatures package. All the string formatting arguments (I hope) are now passed to binary() or native() string conversion functions that do less on Python 2.7 than on Python 3.
One point that this raises. Any such "string-only" format spec would only be available in Python 3.4+, and almost certainly only in format(). So if you're interested in something that works across Python 2 and 3, you wouldn't be able to use it anyway (and something like the must_be_str function is probably your best bet). On the other hand, if you're targeting 3.4+ only, the bytes/string code is probably cleaner (that being a lot of the point of the Python 3 exercise :-)) and so the need for a string-only spec may be a lot less. I dunno. I haven't hit a lot of encoding type issues myself, so I don't have much background in what might help. OTOH, what I *have* found is that the change in thinking that Python 3's approach pushes onto me (encode/decode at the edges and use str consistently internally, plus never gloss over the fact that you have to know an encoding to convert bytes <-> str) fixes a lot of "problems" I thought I was having... Paul.
On Aug 24, 2012 6:17 PM, "Paul Moore" <p.f.moore@gmail.com> wrote:
On 24 August 2012 21:21, Daniel Holth <dholth@gmail.com> wrote:
Hi Paul! You could probably guess that this is the wheel digital signatures package. All the string formatting arguments (I hope) are now passed to binary() or native() string conversion functions that do less on Python 2.7 than on Python 3.
One point that this raises. Any such "string-only" format spec would only be available in Python 3.4+, and almost certainly only in format(). So if you're interested in something that works across Python 2 and 3, you wouldn't be able to use it anyway (and something like the must_be_str function is probably your best bet). On the other hand, if you're targeting 3.4+ only, the bytes/string code is probably cleaner (that being a lot of the point of the Python 3 exercise :-)) and so the need for a string-only spec may be a lot less.
I dunno. I haven't hit a lot of encoding type issues myself, so I don't have much background in what might help. OTOH, what I *have* found is that the change in thinking that Python 3's approach pushes onto me (encode/decode at the edges and use str consistently internally, plus never gloss over the fact that you have to know an encoding to convert bytes <-> str) fixes a lot of "problems" I thought I was having...
That's the core of it. You can convert bytes to string without knowing the encoding. "%s" % bytes. But instead of failing or converting from ascii it does something totally useless. I argue that this is a bug, and an alternative 'anything except bytes' should be available. Not so hot on the competing only-str idea. On Aug 24, 2012 6:17 PM, "Paul Moore" <p.f.moore@gmail.com> wrote:
On 24 August 2012 21:21, Daniel Holth <dholth@gmail.com> wrote:
Hi Paul! You could probably guess that this is the wheel digital signatures package. All the string formatting arguments (I hope) are now passed to binary() or native() string conversion functions that do less on Python 2.7 than on Python 3.
One point that this raises. Any such "string-only" format spec would only be available in Python 3.4+, and almost certainly only in format(). So if you're interested in something that works across Python 2 and 3, you wouldn't be able to use it anyway (and something like the must_be_str function is probably your best bet). On the other hand, if you're targeting 3.4+ only, the bytes/string code is probably cleaner (that being a lot of the point of the Python 3 exercise :-)) and so the need for a string-only spec may be a lot less.
I dunno. I haven't hit a lot of encoding type issues myself, so I don't have much background in what might help. OTOH, what I *have* found is that the change in thinking that Python 3's approach pushes onto me (encode/decode at the edges and use str consistently internally, plus never gloss over the fact that you have to know an encoding to convert bytes <-> str) fixes a lot of "problems" I thought I was having...
Paul.
A couple of people at PyCon Au mentioned running into this kind of issue with Python 3. It relates to the fact that: 1. String formatting is *coercive* by default 2. Absolutely everything, including bytes objects can be coerced to a string, due to the repr() fallback So it's relatively easy to miss a decode or encode operation, and end up interpolating an unwanted "b" prefix and some quotes. For existing versions, I think the easiest answer is to craft a regex that matches bytes object repr's and advise people to check that it *doesn’t* match their formatted strings in their unit tests. For 3.4+ a non-coercive string interpolation format code may be desirable. Cheers, Nick. -- Sent from my phone, thus the relative brevity :)
On 25/08/2012 00:12, Daniel Holth wrote:
On Aug 24, 2012 6:17 PM, "Paul Moore" <p.f.moore@gmail.com <mailto:p.f.moore@gmail.com>> wrote:
On 24 August 2012 21:21, Daniel Holth <dholth@gmail.com
<mailto:dholth@gmail.com>> wrote:
Hi Paul! You could probably guess that this is the wheel digital signatures package. All the string formatting arguments (I hope) are now passed to binary() or native() string conversion functions that do less on Python 2.7 than on Python 3.
One point that this raises. Any such "string-only" format spec would only be available in Python 3.4+, and almost certainly only in format(). So if you're interested in something that works across Python 2 and 3, you wouldn't be able to use it anyway (and something like the must_be_str function is probably your best bet). On the other hand, if you're targeting 3.4+ only, the bytes/string code is probably cleaner (that being a lot of the point of the Python 3 exercise :-)) and so the need for a string-only spec may be a lot less.
I dunno. I haven't hit a lot of encoding type issues myself, so I don't have much background in what might help. OTOH, what I *have* found is that the change in thinking that Python 3's approach pushes onto me (encode/decode at the edges and use str consistently internally, plus never gloss over the fact that you have to know an encoding to convert bytes <-> str) fixes a lot of "problems" I thought I was having...
That's the core of it. You can convert bytes to string without knowing the encoding. "%s" % bytes. But instead of failing or converting from ascii it does something totally useless. I argue that this is a bug, and an alternative 'anything except bytes' should be available. Not so hot on the competing only-str idea.
"Totally useless"? Is it any more "useless" than what happens to lists, dicts, sets, etc?
On Aug 25, 2012, at 09:16 AM, Nick Coghlan wrote:
A couple of people at PyCon Au mentioned running into this kind of issue with Python 3. It relates to the fact that: 1. String formatting is *coercive* by default 2. Absolutely everything, including bytes objects can be coerced to a string, due to the repr() fallback
So it's relatively easy to miss a decode or encode operation, and end up interpolating an unwanted "b" prefix and some quotes.
For existing versions, I think the easiest answer is to craft a regex that matches bytes object repr's and advise people to check that it *doesn’t* match their formatted strings in their unit tests.
For 3.4+ a non-coercive string interpolation format code may be desirable.
Or maybe just one that calls __str__ without a __repr__ fallback? FWIW, the representation of bytes with the leading b'' does cause problems when trying to write doctests that work in both Python 2 and 3. http://www.wefearchange.org/2012/01/python-3-porting-fun-redux.html It might be a bit nicer to be able to write:
print('{:S}'.format(somebytes))
Of course, in the bytes case, its __str__() would have to be rewritten to not call its __repr__() explicitly. It's probably not worth it just to save from writing a small helper function, but it would be useful in eliminating a surprising gotcha. The other option is of course just to make doctests smarter[*]. Cheers, -Barry [*] Doctest haters need not respond snarkily. :)
On 08/27/2012 04:34 PM, Barry Warsaw wrote:
On Aug 25, 2012, at 09:16 AM, Nick Coghlan wrote:
A couple of people at PyCon Au mentioned running into this kind of issue with Python 3. It relates to the fact that: 1. String formatting is *coercive* by default 2. Absolutely everything, including bytes objects can be coerced to a string, due to the repr() fallback
So it's relatively easy to miss a decode or encode operation, and end up interpolating an unwanted "b" prefix and some quotes.
For existing versions, I think the easiest answer is to craft a regex that matches bytes object repr's and advise people to check that it *doesn’t* match their formatted strings in their unit tests.
For 3.4+ a non-coercive string interpolation format code may be desirable.
Or maybe just one that calls __str__ without a __repr__ fallback?
b'a'.__str__() "b'a'"
__str__ still returns the bytes literal representation.
On Wed, Aug 29, 2012 at 6:30 PM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:
On 08/27/2012 04:34 PM, Barry Warsaw wrote:
On Aug 25, 2012, at 09:16 AM, Nick Coghlan wrote:
A couple of people at PyCon Au mentioned running into this kind of issue with Python 3. It relates to the fact that: 1. String formatting is *coercive* by default 2. Absolutely everything, including bytes objects can be coerced to a string, due to the repr() fallback
So it's relatively easy to miss a decode or encode operation, and end up interpolating an unwanted "b" prefix and some quotes.
For existing versions, I think the easiest answer is to craft a regex that matches bytes object repr's and advise people to check that it *doesn’t* match their formatted strings in their unit tests.
For 3.4+ a non-coercive string interpolation format code may be desirable.
Or maybe just one that calls __str__ without a __repr__ fallback?
b'a'.__str__() "b'a'"
__str__ still returns the bytes literal representation.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
This is now a patch at http://bugs.python.org/issue18373. The user can call sys.getbyteswarning() and sys.setbyteswarning(integer) to control whether str(bytes) warns in the current thread, but you also have to adjust the warnings module for it to be useful.
participants (8)
-
Antoine Pitrou
-
Barry Warsaw
-
Daniel Holth
-
Mathias Panzenböck
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Steven D'Aprano