[Python-3000] Making more effective use of slice objects in Py3k

Mon Aug 28 03:58:52 CEST 2006

On 8/27/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 8/27/06, Guido van Rossum <guido at python.org> wrote:
> > On 8/26/06, Jim Jewett <jimjjewett at gmail.com> wrote:
>
> > > For example, you wanted to keep the rarely used optional arguments to
> > > find because of efficiency.
>
> > I don't believe they are rarely used. They are (currently) essential
> > for code that searches a long string for a short substring repeatedly.
> > If you believe that is a rare use case, why bother coming up with a
> > whole new language feature to support it?
>
> I believe that a fair amount of code already does the copying inline;
> suppporting it in the runtime means that copying code becomes more
> efficient, and shortcutting code becomes less unusual.

We're not making progress here. Your beliefs against my beliefs isn't
helpful. Do you have proof that there is code out that that's
inefficient and for which it would *matter* if it became faster?

> > > If slices were less eager at copying, this could be
> > > rewritten as
>
> > >     view=slice(start, stop, 1)
> > >     view(s).find(prefix)
>
> > Now you're postulating that calling a slice will take a slice of an
> > object?
>
> Yes.

I'd rather see an explicit method call. Using "call" as an operation
means no other operation can use the same syntax (on the same objects,
of course); you have to be very sure that there won't be another use
of "call" that would be more useful.

> > Any object? And how is that supposed to work for arbitrary
> > objects?
>
> For non-iterables, it will raise a TypeError.

Duh. I meant for other iterables, like tuples and lists. I'm asking if
you expect that asking for a view on a previously unknown sequence
should return a view on that sequence that behaves just like the
underlying object, and how you are thinking of pulling off that feat.
My claim is that you can't. You need full cooperation of the
underlying object to support views. You could attempt to automatically
provide wrappers for all methods, but since you don't know which of
the parameters or return values represent indices and which don't, you
can't do anything useful. Suppose I have a list [1, 2, 3, 1, 2, 3].
Suppose you don't have built-in knowledge of a list (otherwise I'll
substitute some other object that you don't have built-in knowledge
of). Now suppose you have a view v on the last half of that list, and
you ask for v.count(1). This of course should return 1. But how to do
this unless you how the count() method is implemented on the
underlying object type?

> > I would think that it ought to be a method on the string
> > object
>
> Restricting it to a few types including string might make sense.

Yes please. Without that your proposal is dead in the water.

(With it likely too, but for different reasons.)

> > Also you're postulating that the slice object somehow has the
> > same methods as the thing it slices?
>
> Rather, the value returned by calling the slice on a specific string.
> (I tend to think of this as a "slice of" the string, but as you've
> pointed out, "slice object" technically refers to the object
> specifying how/where to cut.)

And remember, calling buffer() on a unicode object is not a useful
operation unless you're interesting in the underlying bytes.

> > How are you expecting to implement that?
>
> I had expected to implement it as a (string) view, which is why I
> don't quite understand the distinction Nick and Josiah are making.

Well maybe you don't quite understand your own proposal either. :-)

> > But this assumes that string views are 99.999% indiscernible from
> > regular strings
>
> Yes; instead of assuming that a string's data starts n bytes after the
> object's own pointer, it will instead be located at a (possibly zero)
> offset.  No visible difference to python code; the difference between
> -> and . for C code.  (And this indirection is already used by unicode
> objects.)

Only because their original draft design had a kind of views. I expect
they had good reasons to rip out that part...

> > That will never fly. NumPy may get away with non-copying slices, but
> > for built-in objects this would be too big of a departure of current
> > practice. (If you don't stop about this I'll have to add it to PEP
> > 3099. :-)
>
> That's unfortunate, but if you're sure, maybe it should go in PEP 3099.

Ask any Python developer. Slices of mutable objects make copies except in NumPy.

> > > Yes, this does risk keeping all of data alive because one chunk was
> > > saved.  This might be a reasonable tradeoff to avoid the copying.  If
> > > not, perhaps the gc system could be augmented to shrink bloated views
> > > during idle moments.
>
> > Keep dreaming on. it really seems you have no clue about
> > implementation issues; you just keep postulating random solutions
> > whenever you're faced with an objection.
>
> I had thought the problem was more about whether or not it was a good
> idea; the tradeoff might be OK, or at least less bad than the
> complication of fixing it.

It's only a good idea if it works. Details matter.

> As one implementation of fixing it, in today's garbage collection,
> http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=46244&view=markup
> function collect, surviving objects are moved to the next generation
> with gc_list_merge(young, old); before merging, the young list could
> be traversed, and any object whose type has a __condense__ method
> would get it called.  The strview type's __condense__ method would be
> the C equivalent of
>
>     if len(self.src) <= 200:
>         return  # Src object too small to be worth recovering
>     if (len(self) * refcounts(src)) >= len(self.src):
>         return  # Src object used enough to be worth keeping
>     self.src=str(src) # Create a new data buffer, with no extra chars.
>
> (Sent in python because the commented C was several times as long,
> even before checking with compiler.)  As to whether a __condense
> method is a good idea, whether it should really be tied that closely
> to garbage collection, whether it should be limited to C
> implementations ... that I'm not so sure of.

It's up to you to show that this doesn't completely kill performance.
It would take a lot of measurements.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)