[Python-3000] Thoughts on new I/O library and bytecode

Guido van Rossum guido at python.org
Wed Feb 21 07:33:27 CET 2007


On 2/20/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Guido van Rossum" <guido at python.org> wrote:
> > [Note: changed subject]
> > On 2/20/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > I'm not so sure.  The return type on socket.recv and os.read could be
> > > changed to bytes (seemingly without much difficulty),
> >
> > Yes, that's the plan anyway.
>
> Better than returning unicode, but not as good as returning "binary".

It never was the plan to have this return unicode BTW.

What's the difference between "binary" and "bytes"? To me, bytes *means* binary.

> > > and likely could
> > > even be changed to *take* a bytes object as the destination buffer
> > > (ditto for files opened as 'raw').
> >
> > This already works -- bytes support the buffer API.
>
> I was thinking of...
>
>     buff = bytes(4096*[0])
>     received = sock.recv(buff)
>
> It's really only useful when you have a known protocol with fixed size
> blocks, but need it to run more or less forever.  By fixing the buffer
> size, you can have significantly reduced memory fragmentation.

You can do that already with recv_into(), which takes anything that
supports the writable buffer API.

> > > Then again, I've been "eh?" on the whole I/O library thing, and
> > > generally annoyed at the "everything is unicode" idea.
> >
> > Well, unless you remove the str type, how are you going to get rid of
> > the endless problems with unicode where mixing unicode and str
> > sometimes works and sometimes doesn't?
>
> Ooh, one of my favorite games!
>
> * Explicit <conversion to unicode> is better than implicit.
> * In the face of ambiguity, refuse the temptation to guess <what codec
> to use to decode the string>.
> * Errors <when adding strings to unicode> should never pass silently.
>
> There are at least two approaches to solving the problem:
> 1) make everything unicode
> 2) make all implicit conversions an error.

The plan is both.

> Adding strings to unicode should produce an exception.  The fact that it
> doesn't right now, I believe, is both a result of implementation details
> getting in the way of what should happen.

No, it was by design to make things more compatible. I think we can
say that was a mistake; but it was done for that reason, not for
reasons of implementation details.

> Remove the ambiguity, codec
> guessing, etc., raise a TypeError("cannot concatenate str and unicode
> objects"), and move on.
>
> Don't allow up-casting in u''.join() or ''.join() (or their equivalents
> in py3k).

So what would you use the str type for?

> > > Converting all
> > > libraries that currently deal with IO is going to be a pain, especially
> > > if it does any sort of parsing of mixed binary and non-unicode textual
> > > data (like http headers combined with binary posted data or a utf-8
> > > encoded stream).
> >
> > Yeah, I'm not looking forward to that, but I expect it'll be
> > relatively straightforward once we figure out the right patterns;
> > there's just a lot of code to convert. But that's the whole Py3k plan.
>
> No offense, but the plan to convert it all to use bytes, stinks.
> Starting with the API defined in PEP 358, I started converting smtpd (as
> an example), and I found myself *wanting* to use unicode because the
> whole numeric constants and/or bytes('unicode', 'latin-1') got really
> old really fast.

Have you actually looked at the Py3k implementation? It's quite
different from that PEP.

But nevertheless, it's a good experiment; I'll have a look at this myself.

> > > As a heavy user of quite a few of the current standard library IO
> > > modules (SocketServer, asyncore, urllib, socket, etc.) and as someone
> > > who has the "opportunity" to write line-level protocols, I'd be quite
> > > happy with the following...
> > >
> > > 1) add bytes (or add features to array)
> > > 2) rename unicode to text (or str)
> > > 3) renaming str to bin (or some other sufficiently clear name)
> >
> > So you'd have THREE types (bytes, text, bin)? Or are you proposing bin
> > instead of bytes, contrary to what you suggested above?
>
> While I would have some personal uses for bytes, all of them could be
> fulfilled with an expanded array type.

Well, that's what it is, but without the baggage of being able how it
maps to Python objects (that's up to the encode/decode operations
instead).

> If I could have my way
> <dreaming>I'd rename string and unicode, fold some of the features of
> bytes into array, and make socket, etc., return the renamed string
> type</dreaming>.

But which of the two renamed string types? The 8-bit or the unicode string?

> In the case of the standard library that deal with
> sockets, the only changes would generally be a replacing of 'const' to
> b'const'.  That could *almost* be automatic, and would be significantly
> faster (for a computer + human) than converting all of the .split(),
> .find(), etc., uses in the ftplib, *Server, smtplib, smtpd, etc. to
> bytes eqivalents (or converting to and from unicode).

Actually, while they don't exist now, I plan for the bytes type to
have .split() and .find() and most other string methods *except*
.lower() and .islower() and everything else that interprets bytes as
characters.

> It would take me perhaps 20 minutes to update asyncore, asynchat and
> smtpd with the b'binary' semantic.  Based on the last list of methods I
> saw for bytes in PEP 358, I would be, more or less, doing bytes.decode
> ('latin-1') instead of trying to deal with the *crippled* interface that
> bytes offers.

So forget that PEP and help adding these methods to the bytes type in
the p3yk branch.

The b"..." literal proposal is not unpleasant, as long as we can limit
it to ASCII characters and hex/octal escapes.

> Regardless, the performance of those modules would likely suffer when
> confronted with bytes rather than a renamed str, as the current bytes
> type lacks a large number of convenience methods, that I previously
> complained about it not having (which is why I brought up the string
> view and sample implementation in late August/early September 2006).

I think you misunderstood the plans for bytes. The plan is for the
performance with bytes to scream, in part because they are immutable
so one would occasionally save copying a buffer an extra time.

> > > 4) making string literals 'hello' be unicode
> > > 5) allow for b'constant' be the renamed str
> > > 6) add a mandatory 3rd argument to file/open which is the codec to use
> > > for reading
> >
> > And how does that help users or compatibility?
>
> Users who need binary literals (like every socket module in the standard
> library, anyone who does processing of any non-unicode disk/socket/pipe
> data, like marshal or pickle, etc.) wouldn't go insane and add bugs
> trying to switch to the bytes type, or add performance overhead trying
> to convert the received bytes to unicode to get a useful API.

Let's drop the hyperbole.

> > > 7) offer a new function for opening 'binary' files (which are opened as
> > > 'rb' or 'wb' whenever 'r' or 'w' are passed, respectively), which will
> > > remove confusion on Windows platforms
> >
> > This is a red herring. Or I'm not sure I understand this part of your
> > proposal. What's wrong with 'rb'?
>
> Presumption:
>     a = open(filename, 'r' or 'w' ['+'], codec)
> will open a file as unicode in Py3k (if I am wrong, please correct me).

Right.

> Proposal:
>     b = somename(filename, 'r' or 'w' ['+'])
> will be equivalent to:
>     b = open(filename, 'rb' or 'wb' ['+'])
> today.  This prevents the confusion over different argument values
> resulting in different types being returned and accepted by certain
> methods.

Possibly. Though if we keep the 'rb' semantics for open() and this is
just an alias, I'm not sure what we gain except Two Ways To Do It.

In your view, what *do* we gain by using separate factories for binary
and text files? (Except some opportunity for static typechecking, as
binary files don't have the same API!)

> > > Indeed, it isn't as revolutionary as "everything is unicode", but it
> > > would allow the standard library to be updated with a relative minimum
> > > of fuss and muss, without needing to intermix...
> > >     x = bytes.decode('latin-1').USEFUL_UNICODE_METHOD(...)
> > > or
> > >     sock.send(unicode.encode('latin-1'))
> >
> > Actually, with the renamings and everything, it's just about as
> > disruptive as the current proposal, so I'm unclear why you think this
> > is so different.
>
>     sock.send(b'Header: value\r\n')
>               ^
> The above change can be more or less automatic.  The below?
>
>     sock.send(bytes('Header: value\r\n', 'latin-1'))
>
>     sock.send('Header: value\r\n'.encode('latin-1'))
>
> Either of the above is 17 characters of noise that really shouldn't need
> to be there.

If the spelling of a bytes string with an ASCII character value is all
you are complaining about, you should have said so right away.

IMO the hard part with automatically converting sock.send('abc') to
either alternative is to know when when to convert and when not to
convert; the conversion itself is trivial using the sandbox/2to3
refactoring tool. You really should have a look at that.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list