[Python-3000] Immutable bytes -- looking for volunteer

Guido van Rossum guido at python.org
Thu Sep 27 00:58:00 CEST 2007


I find this semi-convincing. It would be very convincing in a
greenfield situation I think.

However there's quite a bit of Python 2.x code around that manipulates
*bytes* in the guise of 8-bit strings, and it uses tests like "if s[0]
== 'x': ..." frequently. This can of course be rewritten using a
slice, but not so easily when you're looping over bytes:

  for b in bb:
    if b == b'x': ...

This becomes the relatively ugly (because it uses a 1-char *string*):

  for b in bb:
    if b == ord('x'): ...

So I've left this as an open issue in PEP 3137.

--Guido

On 9/26/07, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> Dnia 25-09-2007, Wt o godzinie 17:22 -0700, Guido van Rossum napisał(a):
>
> > OK. Though it's questionable even whether a slice of a mutable bytes
> > object should return a mutable bytes object (as it is not a shared
> > view). But as that is what PyBytes currently do it is certainly the
> > easiest...
>
> A slice of a list is a list, as it always have been, so letting slicing
> return the same type as the whole sequence is at least consistent and
> easy to explain. Hard to say though what are typical use cases.
>
> OTOH I believe individual elements of mutable or immutable bytes should
> be ints. Here is why I think that the analogy between characters and
> bytes is not strong enough to let elements of bytes be bytes of length 1
> just because strings do the same.
>
> Bytes are often computed, while characters are often only copied
> from place to place. Arithmetic is defined on ints, but not on bytes
> sequences of length 1. This means that computing a bytes sequence from
> scratch requires explicit conversions between a byte represented by an
> int and a byte represented by bytes of length 1.
>
> There is also a philosophical reason. The division of a string into
> characters is quite arbitrary: considering UTF-16/UTF-32, combining
> characters, the encoding of Hangul, orthography peculiarities,
> proportional fonts, ligatures, variant selectors etc. — all of these
> obscuring the concept of a character and of string length, and
> considering that a sequence of characters might have been decoded from
> or will be encoded into a sequence of bytes with a different length.
> This means that having atomic string components is more a technical
> convenience than a fundamental necessity, that the very concept of a
> character in a Unicode world is arbitrary, and the length of a string is
> more a technical detail of a representation than an inherent property of
> the text being represented. All this means that the concept of a string
> is more fundamental than a character.
>
> OTOH a byte count and byte offsets are usually important in protocols
> based on bytes (except text files when they encode human text). The
> individual bytes are in some sense delimited very sharply from each
> other, the amount of information stored in one byte is very well
> defined. A single byte is a more important concept in a bytes world
> than a character in a text world, it's not merely a sequence with
> length 1.
>
> Having characters different from strings would require creation of a new
> type, because the existing int type is not very appropriate for single
> characters, because many properties differ, e.g. the effect of writing
> to a text file. To avoid the burden of creating a new type for a concept
> which is rarely useful in isolation, strings of length 1 have been
> reused. OTOH the existing int type seems appropriate for elements of
> bytes. They can be easily thought of as just integers in the range
> 0..255, and Python does not use separate integer types for different
> potential ranges.
>
> If you really don't like ints there, I would prefer immutable bytes even
> as elements of mutable bytes. This is just a value isomorphic to an int,
> not an object with its own state. Moreover for atomic objects like
> individual bytes mutability is not helpful to obtain performance, which
> would be a reason to use a mutable type for non-atomic objects even when
> conceptually they are identityless values (mutability often helps in
> such case because an object can be constructed piece by piece).
>
> --
>    __("<         Marcin Kowalczyk
>    \__/       qrczak at knm.org.pl
>     ^^     http://qrnik.knm.org.pl/~qrczak/
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list