[Python-Dev] methods on the bytes object

Mon May 1 07:38:59 CEST 2006

Josiah Carlson wrote:
>> I mean unicode strings, period. I can't imagine what "unicode strings
>> which do not contain data" could be.
> 
> Binary data as opposed to text.  Input to a array.fromstring(),
> struct.unpack(), etc.

You can't/shouldn't put such data into character strings: you need
an encoding first. Neither array.fromstring nor struct.unpack will
produce/consume type 'str' in Python 3; both will operate on the
bytes type. So fromstring should probably be renamed frombytes.

> Certainly it is the case that right now strings are used to contain
> 'text' and 'bytes' (binary data, encodings of text, etc.).  The problem
> is in the ambiguity of Python 2.x str containing text where it should
> only contain bytes. But in 3.x, there will continue to be an ambiguity,
> as strings will still contain bytes and text (parsing literals, see the
> somewhat recent argument over bytes.encode('base64'), etc.).

No. In Python 3, type 'str' cannot be interpreted to contain bytes.
Operations that expect bytes and are given type 'str', and no encoding,
should raise TypeError.

> We've not removed the problem, only changed it from being contained 
> in non-unicode
> strings to be contained in unicode strings (which are 2 or 4 times larger
> than their non-unicode counterparts).

We have removed the problem.

> Within the remainder of this email, there are two things I'm trying to
> accomplish:
> 1. preserve the Python 2.x string type

I would expect that people try that. I'm -1.

> 2. make the bytes object more pallatable regardless of #1

This might be good, but we have to be careful to not create a type
that people would casually use to represent text.

> I do, however, believe that the Python 2.x string type is very useful
> from a data parsing/processing perspective.

You have to explain your terminology somewhat better here: What
applications do you have in mind when you are talking about
"parsing/processing"? To me, "parsing" always means "text", never
"raw bytes". I'm thinking of the Chomsky classification of grammars,
EBNF, etc. when I hear "parsing".

> Look how successful and
> effective it has been so far in the history of Python.  In order to make
> the bytes object be as effective in 3.x, one would need to add basically
> all of the Python 2.x string methods to it

The precondition of this clause is misguided: the bytes type doesn't
need to be as effective, since the string type is as effective in 2.3,
so you can do all parsing based on strings.

> (having some mechanism to use
> slices of bytes objects as dictionary keys (if data[:4] in handler: ...
> -> if tuple(data[:4]) in handler: ... ?) would also be nice).

You can't use the bytes type as a dictionary key because it is
immutable. Use the string type instead.

> So, what to do?  Rename Python 2.x str to bytes.  The name of the type
> now confers the idea that it should contain bytes, not strings.

It seems that you want an immutable version of the bytes type. As I
don't understand what "parsing" is, I cannot see the need for it;
I think having two different bytes types is confusing.

Regards,
Martin