[Python-3000] Immutable bytes -- looking for volunteer

Wed Sep 26 22:00:56 CEST 2007

Dnia 25-09-2007, Wt o godzinie 17:22 -0700, Guido van Rossum napisał(a):

> OK. Though it's questionable even whether a slice of a mutable bytes
> object should return a mutable bytes object (as it is not a shared
> view). But as that is what PyBytes currently do it is certainly the
> easiest...

A slice of a list is a list, as it always have been, so letting slicing
return the same type as the whole sequence is at least consistent and
easy to explain. Hard to say though what are typical use cases.

OTOH I believe individual elements of mutable or immutable bytes should
be ints. Here is why I think that the analogy between characters and
bytes is not strong enough to let elements of bytes be bytes of length 1
just because strings do the same.

Bytes are often computed, while characters are often only copied
from place to place. Arithmetic is defined on ints, but not on bytes
sequences of length 1. This means that computing a bytes sequence from
scratch requires explicit conversions between a byte represented by an
int and a byte represented by bytes of length 1.

There is also a philosophical reason. The division of a string into
characters is quite arbitrary: considering UTF-16/UTF-32, combining
characters, the encoding of Hangul, orthography peculiarities,
proportional fonts, ligatures, variant selectors etc. — all of these
obscuring the concept of a character and of string length, and
considering that a sequence of characters might have been decoded from
or will be encoded into a sequence of bytes with a different length.
This means that having atomic string components is more a technical
convenience than a fundamental necessity, that the very concept of a
character in a Unicode world is arbitrary, and the length of a string is
more a technical detail of a representation than an inherent property of
the text being represented. All this means that the concept of a string
is more fundamental than a character.

OTOH a byte count and byte offsets are usually important in protocols
based on bytes (except text files when they encode human text). The
individual bytes are in some sense delimited very sharply from each
other, the amount of information stored in one byte is very well
defined. A single byte is a more important concept in a bytes world
than a character in a text world, it's not merely a sequence with
length 1.

Having characters different from strings would require creation of a new
type, because the existing int type is not very appropriate for single
characters, because many properties differ, e.g. the effect of writing
to a text file. To avoid the burden of creating a new type for a concept
which is rarely useful in isolation, strings of length 1 have been
reused. OTOH the existing int type seems appropriate for elements of
bytes. They can be easily thought of as just integers in the range
0..255, and Python does not use separate integer types for different
potential ranges.

If you really don't like ints there, I would prefer immutable bytes even
as elements of mutable bytes. This is just a value isomorphic to an int,
not an object with its own state. Moreover for atomic objects like
individual bytes mutability is not helpful to obtain performance, which
would be a reason to use a mutable type for non-atomic objects even when
conceptually they are identityless values (mutability often helps in
such case because an object can be constructed piece by piece).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/