[Python-3000] Making more effective use of slice objects in Py3k

Sun Aug 27 18:52:50 CEST 2006

"Guido van Rossum" <guido at python.org> wrote:
> 
> On 8/26/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > > With stringviews, you wouldn't need to be reindexing from the start of
> > > the original string.  The idiom would instead be a generalization of
> > > "for line in file:"
> > >
> > >     while data:
> > >         chunk, sep, data = data.partition()
> > >
> > > but the partition call would not need to copy the entire string; it
> > > could simply return three views.
> >
> > Also, with a little work, having string views be smart about
> > concatenation (if two views are adjacent to each other, like chunk,sep
> > or sep,data above, view1+view2 -> view3 on the original string), copies
> > could further be minimized, and the earlier problem with readline, etc.,
> > can be avoided.
> 
> But this assumes that string views are 99.999% indiscernible from
> regular strings -- if operations can return a copy or a view depending
> on how things happen to be laid out in memory, It should be trivial to
> write code that doesn't care whether it gets a string or a view.

That's what I'm working towards.  Let us say for a moment that the only
view that was on the table was the string view:
    view = stringview(st[, start[, stop]])

If st is a string, it produces a view on that string.  If st is a
stringview already, it references the original string (removing tree
persistance[1]).

After a view is created, it can be treated like a string for
(effectively) everything because it has an Py_UNICODE* that has already
been adjusted to handle the offset argument.  Its implementation would
require copying the PyUnicodeObject struct, adding one more field:
    PyUnicodeObject* orig_object;
This would point to the original object for the later Py_DECREF (when
the view is destroyed), view creation (again, we don't want tree
persistance), etc.

We can easily discover the 'start' offset again by comparing the
view->str and the orig_object->str pointers.

Optimizations like 'adding properly ordered adjacent string views
returns a new view', 'views over fewer than X bytes are string copies',
etc., could be added later with (hopefully) little trouble.

> This works for strings (which are immutable) but these semantics are
> unacceptable for mutable objects -- another reason to doubt that it
> makes sense to generalize the idea of views to all sequences, or to
> involve a change to the slice object in the design.

I think the whole slice object thing is complete nonsense.

On the other hand, I think that just like buffers are verifying the
object that they are buffering every time they are accessed, mutable
bytes string, array, and mmap views could do the same.  After they are
verified, they can generally be used the same, but it may take some
discussion as to whether certain operations are allowed, and/or what
their semantics are. Things like:
    view = arrayview(arr, 1, -1)
    del view[1:-1]
A convenient semantic (from the Python side of things) is to do as
buffer does now and only allow them to be read-only.

I'm also not terribly convinced about general sequence views, but for
objects in which buffer(obj) returns something useful, I can see
specialized views for them making at least some sense.  I am cautious
about pushing for all of them because implementing views for all would
be a pain. Choosing one (like bytes) would take some effort, but could
easily be pushed back to 3.1 or 3.2 and be done by someone who really
wants them.

 - Josiah

[1] When I say "tree persistance", I mean those cases like a -> b -> c,
where view b persist because view a persists, even though b doesn't have
a reference otherwise.  Making both views a and b reference c directly
allows for b to be freed when it is no longer used.