[Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer)
Phillip J. Eby
pje at telecommunity.com
Sat Sep 29 17:14:04 CEST 2007
At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote:
>Until just before 3.0a1, they were unequal. We decided to raise
>TypeError because we noticed many bugs in code that was doing things
>like
>
> data = f.read(4096)
> if data == "": break
Thought experiment: what if read() always returned strings, and to
read bytes, you had to use something like 'f.readinto(ob, 4096)',
where 'ob' is a mutable bytes instance or memory view?
In Python 2.x, there's only one read() method because (prior to
unicode), there was only one type of reading to do.
But as the above example makes clear, in 3.x you simply *can't* write
code that works correctly with an arbitrary file that might be binary
or text, at least not without typechecking the return value from
read(). (In which case, you might as well inspect the file
object.) So, the above problem could be fixed by having .read()
raise an error (or simply not exist) on a binary file object.
In this way, the problem is fixed at the point where it really
occurs: i.e., at the point of not having decided whether the stream
is bytes or text.
This also seems to fit better (IMO) with the best practice of
enforcing str/unicode/encoding distinctions at the point where data
enters the program, rather than delaying the error to later.
>I thought about using warning too, but since nobody wants warnings,
>that would be pretty much the same as raising TypeError except for the
>most dedicated individuals (and if I were really dedicated I'd just
>write my own eq() function anyway).
The use case I'm concerned about is code that's not type-specific
getting a TypeError by comparing arbitrary objects. For example, if
you write Python code to create a Python code object (e.g. the
compiler package or my own BytecodeAssembler), you need to create a
list of constants as you generate the code, and you need to be able
to search the list for an equal constant. Since strings and bytes
can both be constants, a simple list.index() test could now raise a
TypeError, as could "item in list".
So raising an error to make bad code fail sooner, will also take down
unsuspecting code that isn't really broken, and *force* the writing
of special comparison code -- which won't be usable with things like
list.remove and the "in" operator.
In comparison, forcing code to be bytes vs. text aware at the point
of I/O directs attention to the place where you can best decide what
to do about it. (After all, the comparison that raises the TypeError
might occur deep in a library that's expecting to work with text.)
>And the warning would do nothing
>about the issue brought up by Jim Jewett, the unpredictable behavior
>of a dict with both bytes and strings as keys.
I've looked at all of Jim's messages for September, but I don't see
this. I do see where raising TypeError for comparisons causes a
problem with dictionaries, but I don't see how an unequal comparison
creates "unpredictable" behavior (as opposed to predictable failure to match).
More information about the Python-3000
mailing list