[Python-3000] Immutable bytes type and dbm modules

Wed Aug 8 09:57:05 CEST 2007

I agree completely with Talin's suggestion for the arrangement of the
mutable and immutable alternatives, but there are a couple points here
that I wanted to answer.

On 8/6/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > For low-level I/O code, I totally agree that a mutable buffery object
> > is needed.
>
> The code we are looking at right now (dbm interfaces) *is* low-level
> I/O code.

But you want an immutable interface to it that looks like a dict. I
think that's entirely appropriate because the underlying C code is the
real low-level I/O code, while the Python wrapper is actually pretty
high-level.

> > For example, to support re-using bytes buffers, socket.send()
> > would need to take start and end offsets into its bytes argument.
> > Otherwise, you have to slice the object to select the right data,
> > which *because bytes are mutable* requires a copy. PEP 3116's .write()
> > method has the same problem. Making those changes is, of course,
> > doable, but it seems like something that should be consciously
> > committed to.
>
> Sure. There are several ways to do that, including producing view
> objects - which would be possible even though the underlying buffer
> is mutable; the view would then be just as mutable.
>
> > Python 2 seems to have gotten away with doing all the buffery stuff in
> > C. Is there a reason Python 3 shouldn't do the same?
>
> I think Python 2 has demonstrated that this doesn't really work. People
> repeatedly did += on strings (leading to quadratic performance),

This argues for mutable strings at least as much as it argues for
mutable high-level bytes. Now that they exist, generators are a pretty
natural way to build up immutable objects, so people certainly have
the option to avoid quadratic performance whatever the mutability of
their objects.

> invented the buffer interface (which is semantically flawed), added
> direct support for mmap, and so on.

And those still exist in Python 3 (perhaps in an updated form). A
mutable bytes doesn't obsolete them. It may be a handy concrete type
for the buffer interface, but then so is array.

> > me: [benchmarks showing 10% faster construction]
[Probably this just means that something hasn't been optimized enough
on Intel Macs]
> Martin: [same benchmarks showing 10% faster copying]

I'd really say it's the same result (and shouldn't have claimed
otherwise in my email. Sorry). A 10% difference either way is likely
to be dwarfed by the costs of actually doing I/O. Before picking
interfaces around the notion that either allocation or copying is
expensive, it would be wise to run benchmarks to figure out what the
performance actually looks like.

On 8/7/07, Guido van Rossum <guido at python.org> wrote:
> [list()] would not work with low-level I/O (sometimes readinto() is useful)

When is "sometimes"? Is it the same times that rewriting into C would
be a good idea? I'd really like to see any benchmarks people have
written to decide this.

In any case, the obvious thing to do may well be different when you're
writing performance-critical code and when you're writing code that
just needs to be readable. I haven't seen any such distinguishing
circumstance for the various hashing techniques.

On 8/7/07, Guido van Rossum <guido at python.org> wrote:
> On 8/6/07, Jeffrey Yasskin <jyasskin at gmail.com> wrote:
> ...why are you waiting
> > for a show-stopper that requires an immutable bytes type rather than
> > one that requires a mutable one?
>
> Well one reason of course is that we currently have a mutable bytes
> object and that it works well in most situations.

The status quo argument must be weaker given that bytes hasn't existed
in any released Python. I was really asking why you picked mutable as
the first type to experiment with, and I guess I/O is the answer to
that, although it seems to me like a case of the tail wagging the dog.

On 8/7/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Jeffrey Yasskin wrote:
> > If you have mutable bytes and need an
> > immutable object, you could 1) convert it to an int (probably
> > big-endian),
>
> That's not a reversible transformation, because you lose
> information about leading zero bits.

Good point. You'd need a length along with the data, unless you're
dealing with a fixed-length thing like 4CCs. This is still probably
among the most efficient representations.

On 8/7/07, Guido van Rossum <guido at python.org> wrote:
> But this is impractical -- a very common way to work is to build up a
> set incrementally. With immutable sets this would quickly become
> O(N**2). That's why set() is mutable and {...} creates a set, and the
> only way to create an immutable set is to use frozenset(...).

I would probably default to constructing an immutable set with a
generator. If I needed to do something more complicated, I'd fall back
to a mutable set. Of course, making the name for the immutable version
3 times as long biases the language toward the mutable version.

-- 
Namasté,
Jeffrey Yasskin