python3 -bb and hash collisions

set([u"foo", b"foo]) will error because the two kinds of string have the same hash, and this causes a comparison. Is that correct?

On 18/06/2019 18.32, Daniel Holth wrote:
set([u"foo", b"foo]) will error because the two kinds of string have the same hash, and this causes a comparison. Is that correct?
Yes, it will fail with -bb, because it turns comparison between str and bytes into an error. This can also happen with other strings when hash(u'somestring') & mask == hash(b'otherbytes') & mask. The mask of a set starts with PySet_MINSIZE - 1 == 8 and increases over team. Christian

Thanks. I think I might like an option to disable str(bytes) without disabling str != bytes. Unless the second operation would also corrupt output. Came across this kind of set in the hyper http library which uses a set to accept certain headers with either str or bytes keys. On Tue, Jun 18, 2019, 13:05 Christian Heimes <christian@python.org> wrote:

On 22.06.2019 1:08, Daniel Holth wrote:
Thanks. I think I might like an option to disable str(bytes) without disabling str != bytes. Unless the second operation would also corrupt output.
You can't compare str to bytes without knowing the encoding the bytes are supposed to be in (see https://stackoverflow.com/questions/49991870/python-default-string-encoding for details). And if you do know the encoding, you can as well compare `str.encode(encoding) != bytes` / `str != bytes.decode(encoding)`.
-- Regards, Ivan

22.06.19 01:08, Daniel Holth пише:
Does that library support Python 2? If it is true than you have a problem, because u'abc' == b'abc' in Python 2 and u'abc' != b'abc' in Python 3. If it is Python 3 only, you can just ignore BytesWarning. It was added purely to help to catch subtle bugs in transition to Python 3. In future, after Python 2 be out of use, BytesWarning will become deprecated.

On Sat, Jun 22, 2019 at 2:48 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
I stopped using Python 3 after learning about str(bytes) by finding it in my corrupted database. Ever since then I've been anxious about changing to the new language, since it makes it so easy to convert from bytes to unicode by accident without specifying a valid encoding. So I would like to see a future where str(bytes) is effectively removed. I started working on a pull request that adds an API to toggle str(bytes) at runtime with a thread local (instead of requiring a command line argument), so you could do with no_str_bytes(): if you were worried about the feature, but got a bit stuck in the internals.

On Wed, Sep 11, 2019 at 12:47 AM Daniel Holth <dholth@gmail.com> wrote:
Python has, for as long as I've known it, permitted you to call str() on literally any object - if there's no other string form, you get its repr. Breaking this would break all manner of debugging techniques. ChrisA

On Tue, Sep 10, 2019 at 10:42:52AM -0400, Daniel Holth wrote:
How is this different than all the str -> unicode bugs we had in python2? If you have special needs, you can always monkey-patch it in plain python code by overriding __builtins__.str with something that asserts the given arg is not bytes. m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On Tue, Sep 10, 2019 at 8:38 PM Matt Billenstein <matt@vazor.com> wrote:
It's different. One hint is that there's already an option to disable the feature. The old style of error will occasionally reveal itself with decode errors but the new style error happens silently, you discover it somehow, then enable the -bb option, track down the source of the error, and deal with the fallout. The proposed change would allow `print(bytes)` for (de)bugging by letting you toggle `python3 -bb` behavior at runtime instead of only at the command line. Or you could debug more explicitly by `print(bytes.decode('ebcdic'))` or `print(repr(bytes))` I didn't realize you could override __builtins__.str. That's interesting.

Am 11.09.19 um 15:34 schrieb Daniel Holth:
Being able to call str() on everything is such a fundamental assumption that changing the behavior of str(bytes) would break Python. Porting from Python 2 to Python 3 is a big task and especially the str/unicode/bytes handling needs extra care, and this is one of those corner cases that might prove problematic when porting. That doesn't justify breaking Python, especially not for those users that have decided to port to Python 3 in a timely manner. - Sebastian

On Wed, Sep 11, 2019 at 09:34:07AM -0400, Daniel Holth wrote:
I didn't realize you could override __builtins__.str. That's interesting.
Don't touch __builtins__ that's a CPython implementation detail. The public API is to ``import builtins`` and use that. This override technique is called monkey-patching, it's permitted but considered a fairly dubious thing to do in production code, since it risks breaking other libraries or even parts of your own code which relies on str(b'') working. It may be better to isolate the monkey-patch to the module (hopefully there is only one!) that needs it, by a simple global that shadows the built-in: import builtins def str(obj): assert not isinstance(obj, bytes) return builtins.str(obj) instead of putting it into builtins itself. -- Steven

On Fri., 13 Sep. 2019, 7:21 am Steven D'Aprano, <steve@pearwood.info> wrote:
In a lot of cases like this, the problems aren't with directly calling str(), but calling third party APIs that indirectly call str(). One nice thing a debugging str() monkey patch can do that the -bb command line switch can't is be selective in when it fails, by inspecting the frame stack for the modules of interest before throwing an exception. Cheers, Nick.

On Fri, Sep 13, 2019 at 08:37:26AM +1000, Cameron Simpson wrote:
Not the OP, but I've actually seen something like this happen in postgres, but it's postgres doing the adaptation of bytea into a text column, not python str afaict:
We were storing the response of an api request from requests and had grabbed response.content (bytes) instead of response.text (str). I was still able to decode the original data from this bytes representation, so not ideal, but no data lost. I did wish this sorta thing had raised an error instead of doing what it did. m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On 13Sep2019 09:31, Matt Billenstein <matt@vazor.com> wrote:
Aye. Somewhere there's some Python taking the b'' and accepting it for the notes= parameter, presumably in the postgres dbapi code. That isn't a Python language bug to my eye. It could be some careless 2->3 adaption I guess. I suspect it isn't postgres itself (or its C library) mangling things, it would be accepting a C string or character buffer. Still, I can see how this can quietly leak mojibake into your database. Thanks, Cameron Simpson <cs@cskk.id.au>

On 18/06/2019 18.32, Daniel Holth wrote:
set([u"foo", b"foo]) will error because the two kinds of string have the same hash, and this causes a comparison. Is that correct?
Yes, it will fail with -bb, because it turns comparison between str and bytes into an error. This can also happen with other strings when hash(u'somestring') & mask == hash(b'otherbytes') & mask. The mask of a set starts with PySet_MINSIZE - 1 == 8 and increases over team. Christian

Thanks. I think I might like an option to disable str(bytes) without disabling str != bytes. Unless the second operation would also corrupt output. Came across this kind of set in the hyper http library which uses a set to accept certain headers with either str or bytes keys. On Tue, Jun 18, 2019, 13:05 Christian Heimes <christian@python.org> wrote:

On 22.06.2019 1:08, Daniel Holth wrote:
Thanks. I think I might like an option to disable str(bytes) without disabling str != bytes. Unless the second operation would also corrupt output.
You can't compare str to bytes without knowing the encoding the bytes are supposed to be in (see https://stackoverflow.com/questions/49991870/python-default-string-encoding for details). And if you do know the encoding, you can as well compare `str.encode(encoding) != bytes` / `str != bytes.decode(encoding)`.
-- Regards, Ivan

22.06.19 01:08, Daniel Holth пише:
Does that library support Python 2? If it is true than you have a problem, because u'abc' == b'abc' in Python 2 and u'abc' != b'abc' in Python 3. If it is Python 3 only, you can just ignore BytesWarning. It was added purely to help to catch subtle bugs in transition to Python 3. In future, after Python 2 be out of use, BytesWarning will become deprecated.

On Sat, Jun 22, 2019 at 2:48 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
I stopped using Python 3 after learning about str(bytes) by finding it in my corrupted database. Ever since then I've been anxious about changing to the new language, since it makes it so easy to convert from bytes to unicode by accident without specifying a valid encoding. So I would like to see a future where str(bytes) is effectively removed. I started working on a pull request that adds an API to toggle str(bytes) at runtime with a thread local (instead of requiring a command line argument), so you could do with no_str_bytes(): if you were worried about the feature, but got a bit stuck in the internals.

On Wed, Sep 11, 2019 at 12:47 AM Daniel Holth <dholth@gmail.com> wrote:
Python has, for as long as I've known it, permitted you to call str() on literally any object - if there's no other string form, you get its repr. Breaking this would break all manner of debugging techniques. ChrisA

On Tue, Sep 10, 2019 at 10:42:52AM -0400, Daniel Holth wrote:
How is this different than all the str -> unicode bugs we had in python2? If you have special needs, you can always monkey-patch it in plain python code by overriding __builtins__.str with something that asserts the given arg is not bytes. m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On Tue, Sep 10, 2019 at 8:38 PM Matt Billenstein <matt@vazor.com> wrote:
It's different. One hint is that there's already an option to disable the feature. The old style of error will occasionally reveal itself with decode errors but the new style error happens silently, you discover it somehow, then enable the -bb option, track down the source of the error, and deal with the fallout. The proposed change would allow `print(bytes)` for (de)bugging by letting you toggle `python3 -bb` behavior at runtime instead of only at the command line. Or you could debug more explicitly by `print(bytes.decode('ebcdic'))` or `print(repr(bytes))` I didn't realize you could override __builtins__.str. That's interesting.

Am 11.09.19 um 15:34 schrieb Daniel Holth:
Being able to call str() on everything is such a fundamental assumption that changing the behavior of str(bytes) would break Python. Porting from Python 2 to Python 3 is a big task and especially the str/unicode/bytes handling needs extra care, and this is one of those corner cases that might prove problematic when porting. That doesn't justify breaking Python, especially not for those users that have decided to port to Python 3 in a timely manner. - Sebastian

On Wed, Sep 11, 2019 at 09:34:07AM -0400, Daniel Holth wrote:
I didn't realize you could override __builtins__.str. That's interesting.
Don't touch __builtins__ that's a CPython implementation detail. The public API is to ``import builtins`` and use that. This override technique is called monkey-patching, it's permitted but considered a fairly dubious thing to do in production code, since it risks breaking other libraries or even parts of your own code which relies on str(b'') working. It may be better to isolate the monkey-patch to the module (hopefully there is only one!) that needs it, by a simple global that shadows the built-in: import builtins def str(obj): assert not isinstance(obj, bytes) return builtins.str(obj) instead of putting it into builtins itself. -- Steven

On Fri., 13 Sep. 2019, 7:21 am Steven D'Aprano, <steve@pearwood.info> wrote:
In a lot of cases like this, the problems aren't with directly calling str(), but calling third party APIs that indirectly call str(). One nice thing a debugging str() monkey patch can do that the -bb command line switch can't is be selective in when it fails, by inspecting the frame stack for the modules of interest before throwing an exception. Cheers, Nick.

On Fri, Sep 13, 2019 at 08:37:26AM +1000, Cameron Simpson wrote:
Not the OP, but I've actually seen something like this happen in postgres, but it's postgres doing the adaptation of bytea into a text column, not python str afaict:
We were storing the response of an api request from requests and had grabbed response.content (bytes) instead of response.text (str). I was still able to decode the original data from this bytes representation, so not ideal, but no data lost. I did wish this sorta thing had raised an error instead of doing what it did. m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On 13Sep2019 09:31, Matt Billenstein <matt@vazor.com> wrote:
Aye. Somewhere there's some Python taking the b'' and accepting it for the notes= parameter, presumably in the postgres dbapi code. That isn't a Python language bug to my eye. It could be some careless 2->3 adaption I guess. I suspect it isn't postgres itself (or its C library) mangling things, it would be accepting a C string or character buffer. Still, I can see how this can quietly leak mojibake into your database. Thanks, Cameron Simpson <cs@cskk.id.au>
participants (11)
-
Cameron Simpson
-
Chris Angelico
-
Christian Heimes
-
Daniel Holth
-
Eric V. Smith
-
Ivan Pozdeev
-
Matt Billenstein
-
Nick Coghlan
-
Sebastian Rittau
-
Serhiy Storchaka
-
Steven D'Aprano