[Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer)

Sat Sep 29 05:08:06 CEST 2007

On 9/28/07, Terry Reedy <tjreedy at udel.edu> wrote:
> "Guido van Rossum" <guido at python.org> wrote in message
> news:ca471dc20709281140q2ef95c2ap8bbc7b7d3d46ebc0 at mail.gmail.com...
> |
> | Well, if we wanted "x" and b"x" to compare unequal instead of raising
> | an exception, we could just define it that way (it was that way until
> | just before 3.0a1). But we're explicitly defining it to raise a
> | TypeError so as to catch buggy code. I think trying to fix dict lookup
> | so that it, and only it, treats this as unequal, would be adding too
> | many quirks.
> |
> | We could choose to kill the TypeError altogether. If we keep it, we
> | should consistently let it raise TypeError everywhere.
> |
> | The question is whether it's worth the effort to raise TypeError when
> | the *potential* exists that a certain hash sequence *could* raise this
> | TypeError. I'm less and less convinced -- after all, we're making the
> | exception only for bytes/str, not for other types that might raise
> | TypeError upon comparison.
> |
> | So, I think that after all this was a bad idea. Sorry.
>
> If you mean making a special case exception for string/bytes equality test,
> I agree.  Would a restricted key dict (say, rdict, in collections) solve
> the problem you are aiming at?
>
> import collections
> adict = rdict(str)
> bdict = rdict(bytes)
>
> Now any buggy insertions get caught.

That sounds like a completely different use case -- a typechecking dict.

The use case we started with is to catch programmers who accidentally
mix str and bytes as dict keys -- those programmers aren't likely to
have thought much about their key type, so they're not likely to go
out of their way to use the rdict you propose above.

But here's a clever trick that might just do the job, without any
extra effort: make it so that the hash() of a bytes string containing
only ASCII bytes is the same as that of a text string containing only
ASCII characters. Likely, programmers will attempt to look up keys
that they know are in the dict -- and if they use the wrong type,
because of the identical hash values, they will get the TypeError as
soon as they compare it to the first object at the hashed location.

Even better, in the proposal we'll be reusing the old PyString type
for the new immutable bytes type, and its hash *already* is equal to
that of a PyUnicode object if they both contain the same ASCII bytes
only. (This used to be by design in 2.x, and I maintained this
property when I made PyUnicode's hash a lot faster.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)