[Python-Dev] methods on the bytes object

Mon May 1 06:19:04 CEST 2006

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> 
> Josiah Carlson wrote:
> >> I think what you are missing is that algorithms that currently operate
> >> on byte strings should be reformulated to operate on character strings,
> >> not reformulated to operate on bytes objects.
> > 
> > By "character strings" can I assume you mean unicode strings which
> > contain data, and not some new "character string" type?
> 
> I mean unicode strings, period. I can't imagine what "unicode strings
> which do not contain data" could be.

Binary data as opposed to text.  Input to a array.fromstring(),
struct.unpack(), etc.

> > I know I must
> > have missed some conversation. I was under the impression that in Py3k:
> > 
> > Python 1.x and 2.x str -> mutable bytes object
> 
> No. Python 1.x and 2.x str -> str, Python 2.x unicode -> str
> In addition, a bytes type is added, so that
> Python 1.x and 2.x str -> bytes
> 
> The problem is that the current string type is used both to represent
> bytes and characters. Current applications of str need to be studied,
> and converted appropriately, depending on whether they use
> "str-as-bytes" or "str-as-characters". The "default", in some
> sense of that word, is that str applications are assumed to operate
> on character strings; this is achieved by making string literals
> objects of the character string type.

Certainly it is the case that right now strings are used to contain
'text' and 'bytes' (binary data, encodings of text, etc.).  The problem
is in the ambiguity of Python 2.x str containing text where it should
only contain bytes. But in 3.x, there will continue to be an ambiguity,
as strings will still contain bytes and text (parsing literals, see the
somewhat recent argument over bytes.encode('base64'), etc.). We've not
removed the problem, only changed it from being contained in non-unicode
strings to be contained in unicode strings (which are 2 or 4 times larger
than their non-unicode counterparts).

Within the remainder of this email, there are two things I'm trying to
accomplish:
1. preserve the Python 2.x string type
2. make the bytes object more pallatable regardless of #1

The current plan (from what I understand) is to make all string literals
equivalent to their Python 2.x u-prefixed equivalents, and to leave
u-prefixed literals alone (unless the u prefix is being removed?). I
won't argue because I think it is a great idea.

I do, however, believe that the Python 2.x string type is very useful
from a data parsing/processing perspective.  Look how successful and
effective it has been so far in the history of Python.  In order to make
the bytes object be as effective in 3.x, one would need to add basically
all of the Python 2.x string methods to it (having some mechanism to use
slices of bytes objects as dictionary keys (if data[:4] in handler: ...
-> if tuple(data[:4]) in handler: ... ?) would also be nice).  Of course,
these implementations, ultimately, already exist with Python 2.x
immutable strings.

So, what to do?  Rename Python 2.x str to bytes.  The name of the type
now confers the idea that it should contain bytes, not strings.  If
bytes literals are deemed necessary (I think they would be nice, but not
required), have b"..." as the bytes literal.  Not having a literal, I
think, will generally reduce the number of people who try to put text
into bytes.

Ahh, but what about the originally thought-about bytes object?  That
mutable, file-like, string-like thing which is essentially array.array
('B', ...) with some other useful stuff?  Those are certainly still
useful, but not so much from a data parsing/processing perspective, as
much as a mutable in-memory buffer (not the Python built-in buffer
object, but a C-equivalent char* = (char*)malloc(...); ).  I currently
use mmaps and array objects for that (to limited success), but a new
type in the collections module (perhaps mutablebytes?) which offers such
functionality would be perfectly reasonable (as would moving the
immutable bytes object if it lacked a literal; or even switch to
bytes/frozenbytes).

If we were to go to the mutable/immutable bytes object pair, we could
still give mutable bytes .read()/.write(), slice assignment, etc., and
even offer an integer view mechanism (for iteration, assignment, etc.). 
Heck, we could do the same thing for the immutable type (except for
.write(), assignment, etc.), and essentially replace
cStringIO(initializer) (of course mutable bytes effectively replace
cStringIO()).

> > and that there would be some magical argument
> > to pass to the file or open open(fn, 'rb', magical_parameter).read() ->
> > bytes.
> 
> I think the precise details of that are still unclear. But yes,
> the plan is to have two file modes: one that returns character
> strings (type 'str') and one that returns type 'bytes'.

Here's a thought; require 'b' or 't' as arguments to open/file, the 't'
also having an optional encoding argument (which defaults to the current
default encoding).  If one attempts to write bytes to a text file or if
one attempts to write text to a bytes file; IOError, "Cannot write bytes
to a text file" or "Cannot write text to a bytes file".  Passing an
encoding to the 'b' file could either raise an exception, or provide an
encoding for text writing (removing the "Cannot write text to a bytes
file"), though I wouldn't want to do any encoding by default for this
case.

If there are mutable/immutable bytes as I describe above, reads on such
could produce either, but only one of the two (immutable seems
reasonable, at least from a consistancy perspective), but writes could
take either (or even buffer()s).

> > I mention this because I do binary data handling, some ''.join(...) for
> > IO buffers as Guido mentioned (because it is the fastest string
> > concatenation available in Python 2.x), and from this particular
> > conversation, it seems as though Python 3.x is going to lose
> > some expressiveness and power.
> 
> You certainly need a "concatenate list of bytes into a single
> bytes". Apparently, Guido assumes that this can be done through
> bytes().join(...); I personally feel that this is over-generalization:
> if the only practical application of .join is the empty bytes
> object as separator, I think the method should be omitted.
> 
>   bytes(...)
>   bytes.join(...)

I don't know if the only use-case for bytes would be ''.join() (all of
mine happen to be; non-''.join() cases are text), but I don't see the
motivator for _only_ allowing that particular use.  The difference is an
increment in the implementation; type checking and data copying should
be more significant.

 - Josiah