[Python-3000] Thoughts on new I/O library and bytecode

Wed Feb 21 09:22:56 CET 2007

"Guido van Rossum" <guido at python.org> wrote:
> On 2/20/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > Better than returning unicode, but not as good as returning "binary".
> 
> It never was the plan to have this return unicode BTW.
> 
> What's the difference between "binary" and "bytes"? To me, bytes *means* binary.

Bytes as the type defined in PEP 358 and in the p3yk branch.  Binary is
a renamed Python 2.x str.

> > Ooh, one of my favorite games!
> >
> > * Explicit <conversion to unicode> is better than implicit.
> > * In the face of ambiguity, refuse the temptation to guess <what codec
> > to use to decode the string>.
> > * Errors <when adding strings to unicode> should never pass silently.
> >
> > There are at least two approaches to solving the problem:
> > 1) make everything unicode
> > 2) make all implicit conversions an error.
> 
> The plan is both.

Indeed, but this train of thought was more or less along the lines of
'rename str to binary, rename unicode to text, make adding binary and
text raise an exception'.

> > Adding strings to unicode should produce an exception.  The fact that it
> > doesn't right now, I believe, is both a result of implementation details
> > getting in the way of what should happen.
> 
> No, it was by design to make things more compatible. I think we can
> say that was a mistake; but it was done for that reason, not for
> reasons of implementation details.

Fair enough.  I didn't start using unicode until Python 2.3.

> > Remove the ambiguity, codec
> > guessing, etc., raise a TypeError("cannot concatenate str and unicode
> > objects"), and move on.
> >
> > Don't allow up-casting in u''.join() or ''.join() (or their equivalents
> > in py3k).
> 
> So what would you use the str type for?

The bytes API as defined in PEP 358 is crap.  Using that API for
anything involving sockets, file IO, marshal/pickle, etc., is worse than
writing in pure C.  But I'll get into how happy I am with that later.

> > No offense, but the plan to convert it all to use bytes, stinks.
> > Starting with the API defined in PEP 358, I started converting smtpd (as
> > an example), and I found myself *wanting* to use unicode because the
> > whole numeric constants and/or bytes('unicode', 'latin-1') got really
> > old really fast.
> 
> Have you actually looked at the Py3k implementation? It's quite
> different from that PEP.

Really?  The source tells me that it's more or less the same:
http://svn.python.org/view/python/branches/p3yk/Objects/bytesobject.c?rev=53064&view=auto

About the only thing it has gained is a .join() method, but seems to
have lost append, count, extend, index, insert, pop, remove.  From your
later comments, it seems as though the methods I'm looking for just
haven't been implemented yet, but are going in.

> > While I would have some personal uses for bytes, all of them could be
> > fulfilled with an expanded array type.
> 
> Well, that's what it is, but without the baggage of being able how it
> maps to Python objects (that's up to the encode/decode operations
> instead).

Except that bytes(...)[0] is an integer in range(256).  That smells like
array.array('B', ...) to me.

> > If I could have my way
> > <dreaming>I'd rename string and unicode, fold some of the features of
> > bytes into array, and make socket, etc., return the renamed string
> > type</dreaming>.
> 
> But which of the two renamed string types? The 8-bit or the unicode string?

8-bit; unicode strings being returned from sockets, os.read(), etc.,
would be a waste of time and memory.

> > In the case of the standard library that deal with
> > sockets, the only changes would generally be a replacing of 'const' to
> > b'const'.  That could *almost* be automatic, and would be significantly
> > faster (for a computer + human) than converting all of the .split(),
> > .find(), etc., uses in the ftplib, *Server, smtplib, smtpd, etc. to
> > bytes eqivalents (or converting to and from unicode).
> 
> Actually, while they don't exist now, I plan for the bytes type to
> have .split() and .find() and most other string methods *except*
> .lower() and .islower() and everything else that interprets bytes as
> characters.

Thank Guido.  If bytes gets those methods, then 30% of my concerns
regarding the unicode conversion go out the window.

> > It would take me perhaps 20 minutes to update asyncore, asynchat and
> > smtpd with the b'binary' semantic.  Based on the last list of methods I
> > saw for bytes in PEP 358, I would be, more or less, doing bytes.decode
> > ('latin-1') instead of trying to deal with the *crippled* interface that
> > bytes offers.
> 
> So forget that PEP and help adding these methods to the bytes type in
> the p3yk branch.
> 
> The b"..." literal proposal is not unpleasant, as long as we can limit
> it to ASCII characters and hex/octal escapes.

With a b"..." literal producing bytes (or even a renamed 8-bit string
type), another 30% of my concerns regarding the unicode conversion go
out the window.

Limiting it to ascii and hex\octal escapes is perfectly reasonable to me,
though I don't know enough about the underlying parser to know if such
restrictions are possible, with or without a defined coding: directive
at the beginning of the file.

> > Regardless, the performance of those modules would likely suffer when
> > confronted with bytes rather than a renamed str, as the current bytes
> > type lacks a large number of convenience methods, that I previously
> > complained about it not having (which is why I brought up the string
> > view and sample implementation in late August/early September 2006).
> 
> I think you misunderstood the plans for bytes. The plan is for the
> performance with bytes to scream, in part because they are immutable
> so one would occasionally save copying a buffer an extra time.

...mutable, but yeah - prior to your above statements saying 'we are
going to add find, split, and a bunch of other goodies', I was under the
impression that PEP 358 was more or less the API that we would be
getting - which just about made me cry, until I remembered Python 2.x .

> > > > 4) making string literals 'hello' be unicode
> > > > 5) allow for b'constant' be the renamed str
> > > > 6) add a mandatory 3rd argument to file/open which is the codec to use
> > > > for reading
> > >
> > > And how does that help users or compatibility?
> >
> > Users who need binary literals (like every socket module in the standard
> > library, anyone who does processing of any non-unicode disk/socket/pipe
> > data, like marshal or pickle, etc.) wouldn't go insane and add bugs
> > trying to switch to the bytes type, or add performance overhead trying
> > to convert the received bytes to unicode to get a useful API.
> 
> Let's drop the hyperbole.

If bytes didn't get .find(), .split(), (hopefully .partition()), etc.,
that isn't hyperbole.  The PEP 358 API is horrible.  With bytes getting
those methods, the above statements are no longer relevant.

> > Presumption:
> >     a = open(filename, 'r' or 'w' ['+'], codec)
> > will open a file as unicode in Py3k (if I am wrong, please correct me).
> 
> Right.
> 
> > Proposal:
> >     b = somename(filename, 'r' or 'w' ['+'])
> > will be equivalent to:
> >     b = open(filename, 'rb' or 'wb' ['+'])
> > today.  This prevents the confusion over different argument values
> > resulting in different types being returned and accepted by certain
> > methods.
> 
> Possibly. Though if we keep the 'rb' semantics for open() and this is
> just an alias, I'm not sure what we gain except Two Ways To Do It.

Well, if we moved bytes reading/writing off to the alternate constructor,
then there would be one way to open a file containing unicode, and
another way to open a file containing binary data, which by definition
isn't text, so we should be able to ignore '\r\n' conversions (though I
would miss it, it may be a good idea).

> In your view, what *do* we gain by using separate factories for binary
> and text files? (Except some opportunity for static typechecking, as
> binary files don't have the same API!)

At one time there was a fairly substantial argument over foo(a, b)
returning different types if the *value* of b changed, or in the case of
a.foo(b). For example...

    def decode_codec(a, b):
        return a.decode(b)

    decode_codec('68656c6c6f20776f726c64', 'hex') -> 'hello world'
    decode_codec('hello world', 'latin-1') -> u'hello world'

By offering a secondary function that *only* dealt with the reading and
writing of bytes (or 8-bit renamed str), then we wouldn't have to worry
about...

    open(filename, 'r', 'latin-1').read()
    open(filename, 'r').read()

returning different types.  The latter would be spelled...

    somename(filename, 'r')

And it would be obvious to all readers that one is opening a binary file
and should expect to have .read() return bytes.

> >     sock.send(b'Header: value\r\n')
> >               ^
> > The above change can be more or less automatic.  The below?
> >
> >     sock.send(bytes('Header: value\r\n', 'latin-1'))
> >
> >     sock.send('Header: value\r\n'.encode('latin-1'))
> >
> > Either of the above is 17 characters of noise that really shouldn't need
> > to be there.
> 
> If the spelling of a bytes string with an ASCII character value is all
> you are complaining about, you should have said so right away.

Not just bytes with ascii character values, but not needing to jump
through hoops to send, write, etc., more or less 'fixed' data to a
handle.

> IMO the hard part with automatically converting sock.send('abc') to
> either alternative is to know when when to convert and when not to
> convert; the conversion itself is trivial using the sandbox/2to3
> refactoring tool. You really should have a look at that.

In a few weeks when I'm done with my thesis defense.

If one adds my "concerns are reduced by X%" statements above, one will
notice that it only adds to 60%.  The remaining 40% of my concerns are
more or less related to the pain of conversion.  b"..." and a usable
bytes API do help things significantly, but all conversions are a pain,
especially with a standard library the size of Python's.

A pessimist would say, "leave everything as it is, but make str+unicode
raise an exception" - and aside from pointing to the half-dozen "you are
so wrong" posts in response to my "unicode is easy" claim some time last
year, it would be hard for me to disagree with the "don't change"
position.  I'm sure I can get along *after* the changes, but the changes
aren't going to be pleasant.  Speaking of which, do all of the modules
have maintainers?  Make the maintainers convert them!

 - Josiah