[Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer)

Sat Sep 29 17:14:04 CEST 2007

At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote:
>Until just before 3.0a1, they were unequal. We decided to raise
>TypeError because we noticed many bugs in code that was doing things
>like
>
>   data = f.read(4096)
>   if data == "": break

Thought experiment: what if read() always returned strings, and to 
read bytes, you had to use something like 'f.readinto(ob, 4096)', 
where 'ob' is a mutable bytes instance or memory view?

In Python 2.x, there's only one read() method because (prior to 
unicode), there was only one type of reading to do.

But as the above example makes clear, in 3.x you simply *can't* write 
code that works correctly with an arbitrary file that might be binary 
or text, at least not without typechecking the return value from 
read().  (In which case, you might as well inspect the file 
object.)  So, the above problem could be fixed by having .read() 
raise an error (or simply not exist) on a binary file object.

In this way, the problem is fixed at the point where it really 
occurs: i.e., at the point of not having decided whether the stream 
is bytes or text.

This also seems to fit better (IMO) with the best practice of 
enforcing str/unicode/encoding distinctions at the point where data 
enters the program, rather than delaying the error to later.

>I thought about  using warning too, but since nobody wants warnings,
>that would be pretty much the same as raising TypeError except for the
>most dedicated individuals (and if I were really dedicated I'd just
>write my own eq() function anyway).

The use case I'm concerned about is code that's not type-specific 
getting a TypeError by comparing arbitrary objects.  For example, if 
you write Python code to create a Python code object (e.g. the 
compiler package or my own BytecodeAssembler), you need to create a 
list of constants as you generate the code, and you need to be able 
to search the list for an equal constant.  Since strings and bytes 
can both be constants, a simple list.index() test could now raise a 
TypeError, as could "item in list".

So raising an error to make bad code fail sooner, will also take down 
unsuspecting code that isn't really broken, and *force* the writing 
of special comparison code -- which won't be usable with things like 
list.remove and the "in" operator.

In comparison, forcing code to be bytes vs. text aware at the point 
of I/O directs attention to the place where you can best decide what 
to do about it.  (After all, the comparison that raises the TypeError 
might occur deep in a library that's expecting to work with text.)

>And the warning would do nothing
>about the issue brought up by Jim Jewett, the unpredictable behavior
>of a dict with both bytes and strings as keys.

I've looked at all of Jim's messages for September, but I don't see 
this.  I do see where raising TypeError for comparisons causes a 
problem with dictionaries, but I don't see how an unequal comparison 
creates "unpredictable" behavior (as opposed to predictable failure to match).