PEP 203 Augmented Assignment

No, this isn't the first full draft of the PEP :P While writing up the proposed VM changes, I came to the conclusion my patch was all wrong, and I would like some input in case I forgot something. The current patch introduces the following bytecodes: GETSET_SUBSCR GETSET_SLICE GETSET_SLICE+1 GETSET_SLICE+2 GETSET_SLICE+3 GETSET_ATTR GETSET_NAME GETSET_FAST GETSET_GLOBAL Each bytecode does the loading of the object, the actual operation it is supposed to perform, and the storing in the original place. All opcodes get an argument to specify which operation should be performed, which is also where the need for 2-argument opcodes comes from: the NAME/FAST/GLOBAL opcodes need both a name-index and an opcode argument. My excuse for this design is similar to Guido's excuse for the current Python parser: I was young, I was foolish, and I didn't know how to reuse things, only how to *add* things ;) I was also intimidated by the whole idea of a stack, having never really programmed in a low-level programming language before. I now realize only 11 new opcodes are necessary, and it can even be reduced to 1, by splitting the operation into seperate LOAD, <operation>, and STORE instructions, with a few ROT_TWO and ROT_THREE's mixed in. <operation> could be a new bytecode for each operation (11 in total) or a single 'AUGOP' bytecode with an argument to specify the operation. I think it's less elegant from the bytecode point of view, but it does require a lot less changes in ceval/compile. There is only one significant difference between the current version and this, and that's behaviour in the face of threads. I had originally seen augmented assignment as a way to increment a counter 'thread-safe': the load and store operation are done in one bytecode, and can thus not be interrupted by a thread switch. But after thinking about it, it doesn't make that much sense. It only works for builtin types, not for classes defining a __op_ab__, and possibly not for extention types. In fact, there might be some possibility for a thread switch even in builtin types, though I have never looked at that. Opinions ? Is the semi-reliability in the face of threads a good idea ? Did I miss anything important about the current implementation ? Should I forget about this idea and keep the patch as it is (and go back to my PEP) or should I rewrite the patch and then rewrite the PEP ? This alternate implementation would eradicate my next-to-last objection to including the patch, I think, but we'll discuss the last one, the naming of the __hooks__, after I finish the bloody PEP ;-) -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

I've been thinking a bit about this myself, and I think it's good idea to show the bytecodes generated for the various cases, just to make sure that we understand the semantics. I'll just use +=, but the same list applies to all 11 binary operators (**, *, /, %, +, -, |, &, ^, <<, >>). I'm making up opcodes -- the different variants of LOAD and STORE don't matter. On the right I'm displaying the stack contents after execution of the opcode (push appends to the end). I'm writing 'result' to indicate the result of the += operator. a += b LOAD a [a] LOAD b [a, b] AUGADD [result] STORE a [] a.attr += b LOAD a [a] DUP [a, a] GETATTR 'attr' [a, a.attr] LOAD b [a, a.attr, b] AUGADD [a, result] SETATTR 'attr' [] a[i] += b LOAD a [a] DUP [a, a] LOAD i [a, a, i] DUP [a, a, i, i] ROT3 [a, i, a, i] GETITEM [a, i, a[i]] LOAD b [a, i, a[i], b] AUGADD [a, i, result] SETITEM [] I'm leaving the slice variant out; I'll get to that in a minute. If the right hand side is more complicated than in the example, the line 'LOAD b' is simply replaced by code that calculates the value of the expression; this always ends up eventually pushing a single value onto the stack, leaving anything below it alone, just like the 'LOAD b' opcode. Ditto for the index expression ('i' in the example). Similarly, for the cases a.attr and a[i], if instead of a there's a more complicated expression (e.g. sys.modules[name].foo().bar += 1) the initial 'LOAD a' is replaced by code that loads the object on the stack -- in this example, sys.modules[name].foo(). Only the final selector (".attr", "[i]") is special. (Terminology: a selector is something that addresses a possibly writable component of a container object, e.g. a[i] or a.attr; a[i:j] is also a selector. f() could be seen as a selector but cannot be used on the left hand side of an assignment.) There are two more forms of potential interest. First, what should happen to a tuple assignment? a, b, c += x (Which is exactly the same as "(a, b, c) += x".) I think this should be a compile-time error. If t and u are tuples, "t += u" means the same as "t = t+u"; but if we apply this rule we would get "(a, b, c) = (a, b, c) + u", which is only valid if u is an empty tuple (or a class instance with unusual coercion behavior). But when u is empty, it's not useful (nothing changes), so it's unlikely that someone would have this intention. More likely, the programmer was hoping that this would be the same as "a+=x; b+=x; c+=x" -- but that's the same misconception as expecting "a, b, c = 0" to mean "a = b = c = 0" so we don't need to cater to it. Second, what should happen to a slice assignment? The basic slice form is: a[i:j] += b but there are others: Python's slice syntax allows an arbitrary comma-separated sequence of single indexes, regular slices (lo:hi), extended slices (lo:hi:step), and "ellipsis" tokens ('...') between the square brackets. Here's an extreme example: a[:, ..., ::, 0:10:2, :10:, 1, 2:, ::-1] += 1 First, let me indicate what code is generated for such a form when it's used in a regular expression or assignment. Any such form *except* basic slices (a[i:j], a[:j], a[i:], and a[:]) is translated into code that uses GETITEM or SETITEM with an index that is formed from a simple translation of the actual expressions. - If there are two or more comma-separated values, the index is a tuple of the translations of the individual values. - An ellipsis ("...") is translated into the builtin object Ellipsis. - A non-slice is translated into itself. - A slice is translated into a "slice object", this is a built-in object representing the lower and upper bounds and step. There is also a built-in function, slice(), taking 1-3 arguments in the same way as range(). Thus: - "lo:hi" is equivalent to slice(lo, hi); - "lo:hi:step" is equivalent to slice(lo, hi, step); - omitted values are replaced with None, so e.g. ":hi" is equivalent to slice(None, hi). So, the extreme example above means exactly the same as a[x], where x is a tuple with the following items: slice(None, None) Ellipsis slice(None, None, None) slice(0, 10, 2) slice(None, 10, None) 1 slice(2, None) slice(None, None, -1) Why all this elaboration? Because I want to use this to give a standardized semantics even to basic slices. If a[lo:hi:step] is translated the same as a[slice(lo, hi, step)], then we can give a[lo:hi] the same translation as a[slice(lo, hi)], and thus the slice case for augmented assignment can generate the same code (apart from the slice-building operations) as the index case. Thus (writing 'slice' to indicate the slice object built from i and j): a[i:j] += b LOAD a [a] DUP [a, a] LOAD i [a, a, i] ** LOAD j [a, a, i, j] ** BUILD_SLICE 2 [a, a, slice] ** DUP [a, a, slice, slice] ROT3 [a, slice, a, slice] GETITEM [a, slice, a[slice]] LOAD b [a, slice, a[slice], b] AUGADD [a, slice, result] SETITEM [] Comparing this to the code for "a[i] += b", only the three lines marked with ** are really different, and all that these do is to push a single object representing the slice onto the stack. I won't show the code for "a[i:j:k] += b" or for "a[i:j, k:l]", but it's clear how these should be done. Postscript (unrelated to augmented assignment) It would be nice if the SLICE bytecodes were removed altogether and instead slice() objects would be created for all slices, even basic ones. (I believe this was proposed in this list at some point.) The original SLICE opcodes were introduced in ancient times, when basic slices were the only accepted slice syntax. This would mean that all objects supporting slices would have to support the *mapping* interface instead of (or in addition to) the sequence interface; the mapping interface would have to determine whether a getitem / setitem call was really a slice call and do the right thing. In particular, for backward compatibility, class instances could have a mapping interface whose internal getitem function checks if the argument is a slice object whose step is None and whose lo and hi are None or integers; then if a __getslice__ method exists, it could call that, in all other cases it could call __getitem__. None of the other built-in objects that support slices would have to be changed; the GETITEM opcode could notice that an object supports the sequence interface but not the mapping interface, and then look for a basic slice or an integer and do the right thing. Problems with this are mostly related to the existing C API for slices, like PySequence_GetSlice(), which propagate the various restrictions. Too-much-rambling-reduces-the-chance-of-useful-responses-ly, --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

I should add something about the assumed pseudo thread-safety of a+=b. I think this assumption is bogus, since we have to load a, do some stuff, and then store a, and we can't guarantee that the stuff we do is atomic -- in face we *know* it's not if it involves a user-defined method. Simply put: a += 1 IS NOT ATOMIC! Note that C doesn't guarantee that a++ is atomic either, even if a is declared volatile. (I believe there's an atomic data type, but I don't think it guarantees atomic ++. That would be very expensive because the VM cache would have to be flushed on multiprocessor machines. The only docs I found are at http://www.gnu.org/manual/glibc-2.0.6/html_node/libc_365.html.) --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Guido van Rossum wrote:
If that's the case, then why do we need augemented assignments at all ? (I understood the atomicness being the main argument for introducing augmented assigns.) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Thu, Jul 27, 2000 at 10:02:49AM +0200, M.-A. Lemburg wrote:
Guido van Rossum wrote:
I should add something about the assumed pseudo thread-safety of a+=b.
Simply put:
a += 1 IS NOT ATOMIC!
Nah, that was just something that emerged from my implementation, and only stood up in the face of simple builtin objects. I probably shouldn't have mentioned this ;-P I can't say anything about Guido's reasons for augmented assignment, but mine are ease-of-use, ease of use, and the ease of usage. And-python-is-all-about-ease-of-use-ly y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Huh? Where did you get that idea? Can you point to a message where this was indicated? The reason for introducing it is that it's been the single-most requested feature ever since Python's inception. I think the reason it's been requested so much is that it's a waste of typing skills as well as (and more importantly) unnecessary extra effort to deduce what's going on in code like a.b.f().x[i] = a.b.f().x[i] + 1 Another important reason is that in cases like a[long_and_expensive_call()] = a[long_and_expensive_call()] + 1 you want to avoid doing the long_and_expensive_call() twice. Introducing a temp variable reduces the readability of the code. Note that my proposed semantics guarantee that in a[long_and_expensive_call()] += 1 the index is computed only once -- but this is not the same as atomicity (which has to do with other threads, signal handlers and similar asynchronous things). --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Guido van Rossum wrote:
It was advertised this way by Thomas: http://www.python.org/pipermail/python-list/2000-June/060680.html And since assignments in Python are thread safe, I thought it would only be natural for augemented assigns to also be thread safe which would have been a real advantage, since a=a+1 is not thread safe. All other uses simply boil down to reducing typing efforts, IMHO. (But that shouldn't keep you from adding it to the language ;-) Some comments:
This can be written as: result = a.b.f().x result[i] = result[i] + 1
Another important reason is that in cases like
a[long_and_expensive_call()] = a[long_and_expensive_call()] + 1
Dito for this one: i = long_and_expensive_call() a[i] = a[i] + 1
you want to avoid doing the long_and_expensive_call() twice. Introducing a temp variable reduces the readability of the code.
Not really... in fact if the name makes it clear what it refers too, it can augment the readbility. Besdies, a.b.f().x[i] isn't good style to begin with ;-)
Nevermind... I just wanted to hint at the added value of making these assignments atomic operations and not just an optimization of a = a + 1. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

>> Another important reason is that in cases like >> >> a[long_and_expensive_call()] = a[long_and_expensive_call()] + 1 M-A> Dito for this one: M-A> i = long_and_expensive_call() M-A> a[i] = a[i] + 1 Only if you know that long_and_expensive_call() has no side effects! Skip

Skip Montanaro wrote:
If it does, then you'll have to write the long version anyway, if it doesn't I don't see a problem, if you don't know then you shouldn't use the function anyway ;-)) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Guido]
For people who don't know, or may have forgotten, threads are not a part of std C or C++: those languages say nothing whatsoever about behavior in the presence of threads. And this is still true in C99! ANSI/ISO committees work by consensus, and there are so many goofy thread models out there consensus will never be reached (keep that in mind as you ponder how PEPs should proceed <0.6 wink>).
(I believe there's an atomic data type, but I don't think it guarantees atomic ++.
You may be thinking of sig_atomic_t? That relates to a technical meaning for "atomic" in signal handlers, a meaning that has nothing to do with threads. In effect, sig_atomic_t is simply a volatile integral type small enough so that read and write each take one machine instruction (so, e.g., a 64-bit int would not serve as a suitable sig_atomic_t on a 32-bit machine, as it would have to be read and written "in pieces", and an interrupting signal could cause an interrupted read or write to use inconsistent pieces).

On Thu, Jul 27, 2000 at 12:59:15AM -0500, Guido van Rossum wrote:
I take it you rather see a 'AUGOP' bytecode than a 'GETSET_<type>' bytecode, that does the loading, operation and storing ? Not that I disagree, not at all, I just want to have that clear ;)
Here's what happens now:
(a, b, c) += 1,2,3 SyntaxError: no augmented assignment to tuple
(Isn't a reference implementation a cool thing ? ;)
Second, what should happen to a slice assignment? The basic slice form is:
a[i:j] += b
but there are others: Python's slice syntax allows an arbitrary [yadah]
What's so special about slice assignment ? You yourself (though Tim<wink>) suggested the following: x += y is effectively (minus order and number of evaluations) x = x.__add_ab__(y) Where __add_ab__() is the Python method, if 'x' is a python class, or plain '__add__' if 'x' doesn't have an '__add_ab__', or 'y.__radd__' if x doesn't have __add__. Similarly, if 'x' is a builtin, tp_as_number->nb_inplace_add is used, or tp_as_number->nb_add, etc. (Or sq_concat, if it's a seq and the operator is add.) The __add_ab__ method, or the C equivalent, can decide for itself whether it should return 'self', thus making the operation truly in-place, or a new instance of an object. This may seem as a complicated and unexpected way to do things, but it makes it possible to transparently support __add__ and __radd__ too, because they already return new objects. And you (through Tim) literally scolded me for trying to suggest x[1:10] += y being something like x = x.__add_slice_ab__(y, slice(1,10)) Instead, it should become x[1:10] = x[1:10] + y (Or rather, to keep the same order of evaluation:) tmp1 = y # in case of a more complicated expression tmp2 = x[1:10] x[1:10] = tmp2.__add_ab__(tmp1) The whole story about how complicated, convoluted and confusing Python's slicing is, though interesting and correct, is not connected to augmented assignment ;)
a[:, ..., ::, 0:10:2, :10:, 1, 2:, ::-1] += 1
Becomes tmp1 = a[:, ..., ::, 0:10:2, :10:, 1, 2:, ::-1] a[:, ..., ::, 0:10:2, :10:, 1, 2:, ::-1] = tmp1.__add_ab__(1)
Agreed, however! You can't just make all slices use a sliceobject, for two reasons: a C extention type can use both the sequence and the mapping interface, and might not expect basic slices to be turned into a slice object. For instance, it's mapping interface may raise a TypeError, "Slices not supported", when it encounters a slice. Choosing mapping over sequence methods may break all that code. The same goes for Python classes. The other reason is usability. The current __get_slice__(self, start, end) sytnax is very convenient for nearly all uses of slices. Possibly because the builtin types don't handle extended slices currently, and possibly because the slice-object way is not generally known. (Just recently, there was someone asking on python-list on how to use slices. I'm not sure if he got a proper answer, but I've never seen such questions on the subject of basic slices!) If we do get a new __get_slice__ that handles all slices, I suggest doing it like this: class X: def __get_newslice__(self, (start, end, step)): where the type of start, end and step is not defined. This can be done with slice objects and sequence-unpacking, of course, but it makes it a lot more usable. Back-to-my-meetings-ly y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Yes, mostly because there are so many variants based on the kind of loading and storing: for each load-type operator you need 11 GETSET operators, and there are many different load operators: local, global, unspecified (== local by dict instead), by attribute, by index, or by slice... I am still neutral on the choice between a single AUGOP with an argument that takes an argument specifying the opcode, and 11 new operators: AUGASS_ADD, ... (Actually, the latter seem the most logical given that we also have BINARY_ADD, ...)
Very nice. Don't touch that part.
The only special thing about slice assignment is that there are 12 extra opcodes to deal with slices in the non-augmented-assignment case. I was making a case for avoiding this explosion, but somehow I got lost in my own train of thought. :(
Yes. I should have realized earlier on that my pseudo code was just spelling out the necessary DUP etc. opcodes to avoid calculating the location of 'a' or the index(es) twice. Looking at my pseudo code for "a[i:j] += b" again, I realize that there's no reason to treat basic slices different than in other contexts -- it can be done with just DUP, ROT3 and ROT4: a[i:j] += b LOAD a [a] DUP [a, a] LOAD i [a, a, i] DUP [a, a, i, i] ROT3 [a, i, a, i] LOAD j [a, i, a, i, j] DUP [a, i, a, i, j, j] ROT4 [a, i, j, a, i, j] GETSLICE [a, i, j, slice] LOAD b [a, i, j, slice, b] AUGADD [a, i, j, result] SETSLICE [] So we need a new opcode ROT4 (or ROT_FOUR to be consistent in the spelling).
That's right. The backwards compatibility issues are the most daunting.
Another alternative that tries to preserve compatibility is to call __getslice__(self, start, end) when the step is None, but __getslice__(self, start, end, step) when it isn't. Old code will raise a reasonable exception when a step is specified, while step-aware code can specify the step as a default argument value of 1 or None. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Thu, 27 Jul 2000, Guido van Rossum wrote:
I think quite a lot of the code I wrote in compile.c for my augmented assignment patch lives on in Thomas's patch, and I was very conservative; if I couldn't work out what something should mean, I banned it (eg. a += b += c is a syntax error too). On reflection, I think I got this right. The error messages could be imporoved, I suspect. [snip to a very nearly completely different topic]
Good idea. Are lists and tuples going to support seq[a:b:c] anytime soon? Patches welcome, I'd guess... Cheers, M.

On Thu, Jul 27, 2000 at 01:55:19PM +0100, Michael Hudson wrote:
Yup, that's all yours. I wouldn't have known how to ban it anyway, and I wouldn't have known where to start *anything* without your patch, Michael! Gratefully-y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Good idea. Are lists and tuples going to support seq[a:b:c] anytime soon?
It's on my wish list, but can't be done for 2.0.
Patches welcome, I'd guess...
Yes. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Thu, Jul 27, 2000 at 08:37:24AM -0500, Guido van Rossum wrote:
Except if you add a new type of bytecode, the 2-argument type. Which is what my implementation currently does. It necessitates just 9 new bytecodes that way: GETSET_SUBSCR, SLICE+0 through SLICE+3, NAME, FAST, GLOBAL and ATTR. The last four need 2 arguments, one for the name-index (or local vrbl number) and one for the actual operation.
Agreed. However, there isn't enough place in the current bytecode ranges for 9 consecutive non-argument opcodes. I'm also not sure how scarce a commodity opcodes are. Then again, it can also be 'undone' later.
What's so special about slice assignment ?
There's no explosion in the augmented assignment case, anyhow: either it's just 4 opcodes for handling slices, or it reuses the current opcodes.
So we need a new opcode ROT4 (or ROT_FOUR to be consistent in the spelling).
Yes. How about a ROT_ANY <x> argument opcode while we're at it ? <x> would specify the number to rotate... you could even encode *how* to rotate in that argument. (using the upper byte, imposing a sensible limit of 255 for the number of rotations ;-)
class X: def __get_newslice__(self, (start, end, step)):
Yes, that's very reasonable. I would suggest making the error 'special', in this case, though. So that you don't get a TypeError about incorrect number of arguments, but rather a 'object does not support extended slices' or some such TypeError. Actually, it might be worth doing it like this: Try to call __getslice__ with unconverted start, stop and step If it fails with a TypeError, and step is not None, raise the above error Else, convert start and stop to ints like the current normal slice behaviour, and try __getslice__ the old way. Or is that too much magic ? I don't think it'll break anything written in Python code, but I'm not sure what to do with the C API... But it's easier to break the API than Python code, me thinks. Break-over-godda-run-ly y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

But the two-arg opcode format slows down the opcode decoding -- and that's the most time-critical part of all of ceval.c!
There's no need for them to be consecutive! (The BINARY_* family isn't even consecutive!) I say, just some open ranges, like 54-59 and 73-79.
No, let's not overgeneralize.
Yes.
No, trying to call something and retrying if it fails is a bad idea: the code might fail for a very different reason, and retrying it might mask the bug, or execute a side effect twice... There's no reason why we can't check whether the step is None; it's already impossible to find out whether a particular slice was "lo:hi:None" or "lo:hi:" (once extended slice mode is used). --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Thu, Jul 27, 2000 at 09:22:48AM -0500, Guido van Rossum wrote: [about INPLACE_ADD/SUBTRACT/ETC]
[and about ROT_ANY]
No, let's not overgeneralize.
Roger dodger, wilco dilco, milli vanilli.
Yeah, I realized that too, while listening one of my colleagues trying to make some point or other. Hrm... So backwards compatibility is out ? I'm not sure howmany Python code uses slice-objects, but I would consider it a (personal ;) failure if slicing has to be *broken* before it can be fixed. If we can't figure out whether a function supports the 'new syntax' reliably, I don't see how we can transform the error message either. I'll bet inspecting the __getitem__ method to find out whether it supports the one or the other is way too much of a runtime penalty to suffer at each index. So the only workable way to support all combinations of __getitem__ and __getslice__ is by adding a new builtin, __getitems__ ? (too confusing a name... __getanything__ ? :P) Feeble-(after-all-those-meetings)-ly y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Huh? I thought I proposed a b/w compat solution: IF there is a __getslice__ method: IF the slice step is None: call __getslice__(lo, hi) ELSE: # the slice step is not None call __getslice__(lo, hi, step) What's wrong with that?
No, we don't deal with __getitem__; if *it* doesn't support a tuple containing slice objects, tough luck. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Thu, Jul 27, 2000 at 12:03:39PM -0500, Guido van Rossum wrote:
Huh? I thought I proposed a b/w compat solution:
What's wrong with that?
Well, what happens if __getslice__ and __getitem__ both exist, and __getitem__ is used to handle extended slices, but __getslice__ isn't (yet)? Currently, if it's a single item, __getitem__ is called. If it's a basic slice, __getslice__ is called. If it's an extended slice, __getitem__ is called. In the above picture, __getslice__ would be called instead, with the 'wrong' number of arguments, and the use of extended slices would break.
No, we don't deal with __getitem__; if *it* doesn't support a tuple containing slice objects, tough luck.
Sorry, that was my feebleness popping up. I was talking about __getslice__. How do we find out if __getslice__ accepts 4 arguments, rather than 3 ? -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

On Thu, Jul 27, 2000 at 06:13:16PM +0200, Thomas Wouters wrote:
Sorry, that was my feebleness popping up. I was talking about __getslice__. How do we find out if __getslice__ accepts 4 arguments, rather than 3 ?
Actually, it wasn't the feebleness as much as the two different things you seem to be wanting, Guido ;) Previously (in another thread, and perhaps in the start of this thread as well) you argued that an index should always call __getitem__, with a slice object if it's a slice of some kind, *even a basic one*. So that we lose __getslice__ and rather teach people to use __getslice__. So now you want to *extend* the 'special case' to extended, 3-argument slices, but *not* Ellipses and tupled-slices (which aren't really tuples, because you can do (1, 2:5, ...)) -- or am I reading this wrong ? :) Don't worry, Guido, it isn't just you, I failed to grasp what those managers were trying to tell me today, too... I may just be too dense, today. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

That would be nice, but it would be b/w incompatible, so now we're trying (again) to figure out how to still support __getslice__. My proposal is to try and be b/w compatible if __getslice__ exists. See my previous mail.
Not sure what you mean by that. (1, 2:5, ...) by itself is illegal syntax; the special slice syntax can only occur directly in an indexing context, e.g. a[1, 2:5, ...].
Don't worry, Guido, it isn't just you, I failed to grasp what those managers were trying to tell me today, too... I may just be too dense, today.
I hope they weren't telling you to stop working on Python! You seem made for this work... --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Thu, Jul 27, 2000 at 12:30:51PM -0500, Guido van Rossum wrote:
Yes. So what we're talking about is a better way to do indexing all together, all types in one shot. ('indexes', 'slices' and 'collections' ? I'm not sure how to name the first and the last one, if not that.) And not just indexing, but setting and deleting at the same time, right ? So the most preferred way is to pass a single argument which is either: - A single object (directly what was passed in) for a single index. - A slice-object if any kind of single-slice was passed in, which holds the zero-to-three objects that the slice was 'created' with. - A tuple of zero or more of either of the above two. While I'm writing this down, I'm not sure if it's such a good idea. Isn't this placing a tad too much into one function ? It might require some severe logic to support this all, especially if you give 'special' meanings to some indexes. And we're inserting a new catch-all method -- a set of them, actually: get, set and del -- and that while Paul is trying to solve the other catch-all Python has, __getattr__/__setattr__. Then again, the use of this method is quite different from __getattr__/__setattr__... Most classes will commit to either a sequence-type or a mapping-type, and not even bother with extended slices or tuple-indexes. And lets not forget the convenience of writing those methods: __getitem__ is intuitive, both in name and semantics. So is __getslice__. The passing of a slice object to __getitem__ is a tad less intuitive, and a tuple of index/slice objects even less. I'm tempted to suggest a single change: when __getslice__ is not defined, pass a slice object (with no step, as if the slice had a trailing ':') to __getitem__, and document it properly. (and make builtin objects accept sliceobjects too !) Perhaps try to slowly deprecate getslice__. Give plenty __of examples of using __getitem__ and slice objects in the standard documentation. Also, I think it makes sense to make slice objects indexable, so that you can do: start, end, step = sliceobj instead of the less intuitive start, end, step = sliceobj.start, sliceobj.end, sliceobj.step But I've never used slice objects myself, so I can't really say whether it's a good idea. I suspect this is all for 2.1 or later, though. (As for what ?!ng suggested, the variable-args __hooks__, I don't see a difference between __getitem__(self, *val) and __getitem__(self, val) where val is a tuple. The same reasoning stands.)
Exactly my point ;P Ferget it, not important. What I was trying to say was that: x[1, 2, 4:10:2] isn't the same as x[(1, 2, 4:10:2)] So the first isn't really indexing with a tuple -- because you can't do the 2nd. But the result as passed to __getitem__ *is* a tuple, so forget what I said ;)
Don't worry, Guido, it isn't just you, I failed to grasp what those managers were trying to tell me today, too... I may just be too dense, today.
I hope they weren't telling you to stop working on Python! You seem made for this work...
Heh, nono, they don't have any say over what I do with Python. I'd force them to join the Python Consortium before I'd let them have any say over that ;) I don't do any *work* with Python, other than waiting for Barry to bring Mailman 2.0 out of beta so we can start a 'product line' on that, instead of Majordomo. It has nothing to do with Python but everything with my colleagues ;-) In fact, the only people who even know what Python is are my direct colleagues, my co-System-administrators, one of which is my direct boss. I didn't even see those people today, because half the meetings today and yesterday were with a bunch of product developers and the senior VP of technology (that's his translated title), and the other half was with the workers council, the management team and the CEO (sort of) of the company. And besides, even if they did try to tell me to stop with Python, it would go like the first episode of Red Dwarf, where Lister comes out of stasis and Holly, the shipboard computer, tries to tell him everyone's dead and he's been in stasis for 3 million years. "So where is everyone, Hol ?" "They're dead, Dave." "Who, the captain ?" "Everyone, Dave." "What, Petersen ?" "Yes, Dave. Everyone. Everyone is dead." "Not Kochansky!" "Yes, Dave. Everyone is dead. Dead is everyone. Is everyone dead. Dead everyone is." "Wait. What are you trying to tell me here, Hol ?" "You have to quit with Python, even in your free time." "What are you trying to tell me ?" ... And if that failed, I'd quit. I don't work here to get dictated how to do my work, and I certainly don't work here to get dictated how to live my life *outside* of work. I realize that's how PythonLabs operates<big wink> but it's not how we run things in the old country, Guido ! :) No, all work I did and am doing on and with Python was hobby work. Which is just fine with me, for the moment, because it means I can slow down the pace whenever I want to. (Not that I've had that urge yet.) Can't-imagine-working-full-time-on-or-with-Python---two-week-bliss!- -seven-week-burnout-ly y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Yes.
(Make that a tuple of two or more!) Indexing like a[] is illegal, and indexing like a[i] passes a non-tuple.
Actually, __getitem__ *already* has to deal with -- the only case it doesn't get currently is the step-less single slice, which gets passed to __getslice__ -- and dies if it's not there.
Don't worry, that's already the case.
Great! That's *exactly* what I've been proposing.
(and make builtin objects accept sliceobjects too !)
That gets a +1 from me too, in general (there's some doubt about the usefulness of things like ``L1[lo:ho:step] = L2'').
Perhaps try to slowly deprecate getslice__.
Yes.
Sure.
Maybe that's a good idea.
I suspect this is all for 2.1 or later, though.
Actually, the actual change (fall back on __*item__ with a slice object when __*slice__ doesn't exist) should be simple to add for 2.0. Just submit to the SF PM! --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Sun, Jul 30, 2000 at 12:29:58PM -0500, Guido van Rossum wrote:
(Make that a tuple of two or more!) Indexing like a[] is illegal, and indexing like a[i] passes a non-tuple.
No sir:
It should, but it doesn't, in most cases. God knows the code I wrote that had getitem and getslice didn't take slice objects into account, mostly because I didn't know how they worked ! Not to mention slice-tuples, which I didn't know were possible until you mentioned it ;)
Don't worry, that's already the case.
I know, I was arguing for an improvement, here :)
[snip more approvals]
I suspect this is all for 2.1 or later, though.
That's what I thought. I'll see about that inbetween augmented assignment (which is not easy to do right, in retrospect. I almost uploaded a patch that evaluated 'x' in 'x += 1' twice ;) unless someone beats me to it, of course. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Good point. Sigh. I suppose we could inspect the __getslice__ method's argument count (it *is* available somewher) but that seems a hack (and could still break if default arguments were used for something else). Another solution: require a class variable to indicate the class's awareness: e.g. you must define __getslice_takes_three_args__ when __getslice__(lo, hi, step) is supported, otherwise the call goes to __getitem__(slice(lo, hi, step)). This is a bit like the feature flag bits in the type struct. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

On Thu, 27 Jul 2000, Guido van Rossum wrote:
Yuck!
Here's another solution: if you're going to revamp the whole item/slice-getting interface, which is what this pretty much amounts to, use a separate interface. Old interface: __getitem__(index) # I get an ITEM. __getitem__(slice-object) # I get a SLICE or an ITEM. (!) __getslice__(lo, hi, step) # I get a SLICE. New interface: __get__(index) # I get an ITEM. __get__(slice-object) # I get a SLICE or an ITEM. (!) To be honest, the whole collision between slices and items makes me rather uncomfortable. I would prefer to separate them: one method for getting *items*, one method for getting *slices*. Neither of the interfaces above handles this correctly for the case of multidimensional arrays. So, how about: __get__(index, ...) # I want an ITEM. __getslice__(slice-object, ...) # I want a SLICE. I think an interface like this would be much easier to program against. The return value is better defined: __get__ always returns a single item; __getslice__ always returns some kind of collection of items (usually, if not always, the same type as the original collection being sliced). This would require detection of whether any of the components of a multidimensional (i.e. tuple) index were slices, but i think the improved consistency is well worth it. __getslice__ is called if any slices are involved; __get__ is called otherwise. Here are some examples for illustration: a[i] a.__get__(i) a[i, j] a.__get__(i, j) a[i:j] a.__getslice__(slice(i, j)) a[i:j:k] a.__getslice__(slice(i, j, k)) a[i, j, k] a.__get__(i, j, k) a[i, j, k:l] a.__getslice__(i, j, slice(k, l)) I can't imagine ever wanting to define custom behaviour for __getslice__ *without* implementing __get__ or __getitem__, so the presence of a __get__ method can be used to detect whether we should use the new interface. -- ?!ng

On Thu, 27 Jul 2000, Ka-Ping Yee wrote:
Addendum: For __set__ and __setslice__, since there may be a variable number of index arguments, it seems the value to set must *precede* the indices. The full interface: __get__(*indices) __set__(value, *indices) __del__(*indices) __getslice__(*slices) __setslice__(value, *slices) __delslice__(*slices) I prefer putting the indices last to passing in a tuple of indices for the following reasons: 1. It would be annoying and slightly slower to have to declare __get__((index,)) instead of __get__(index). 2. I want to encourage prediction of the number of indices. Since most multidimensional types are likely to have a fixed number of dimensions, it's better to have an interface look like, e.g. __get__(x, y, z) # Clearly a 3-dimensional array! instead of __get__(indices) # Don't know how many dimensions. The error is caught earlier, and the expectations of the programmer are more visible. To implement a variable number of dimensions, of course one can always write __get__(*indices) but this would not be the conventional thing to write: it would make clear that there is something special going on here. -- ?!ng

Ka-Ping Yee <ping@lfw.org>:
__get__(index, ...) # I want an ITEM. __getslice__(slice-object, ...) # I want a SLICE.
I like this (although it would be safer if the new-style __getslice__ had a new name as well).
This would require detection of whether any of the components of a multidimensional (i.e. tuple) index were slices
Doesn't the parser know this already? By the way, if you're going to make a clear separation between items and slices, then either *none* or *all* of the members of a multidimensional index should be slices, and mixing them should be an error! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Yes.
Huh? The way I understand this, mixing indices and slices is used all the time to reduce the dimensionality of an array. E.g. if A is a 2-dimensional array, A[0] is a 1-dim vector, and so are A[0, 1:-1] and A[:-1, 1]. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

The way I understand this, mixing indices and slices is used all the time to reduce the dimensionality of an array.
I wasn't really suggesting that they should be disallowed. I was trying to point out that their existence makes it hard to draw a clear distinction between indexing and slicing. If it were the case that a[i,j,...,k] was always equivalent to a[i][j]...[k] then there would be no problem -- you could consider each subscript individually as either an index or a slice. But that's not the way it is. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Thu, 27 Jul 2000, Guido van Rossum wrote:
I don't think it can work: what about inheritance? Think of these two cases: class A: __getslice_takes_3_args__ def __getslice__(...): ... class B(A): def __getslice__(...): ... # something completely different and class C(A): def __getslice__(self, *args): # do something apply(A.__getslice__, (self,)+args)
This is a bit like the feature flag bits in the type struct.
Which are ugly as hell too.....and only work because builtin types have a flat inheritance. oh-well-python-3000-is-just-around-the-corner-ly y'rs, Z. -- Moshe Zadka <moshez@math.huji.ac.il> There is no IGLU cabal. http://advogato.org/person/moshez

Guido van Rossum <guido@beopen.com>:
But the two-arg opcode format slows down the opcode decoding -- and that's the most time-critical part of all of ceval.c!
I don't see why that has to be so, as long as you don't try to pre-fetch the extra argument before switching on the opcode. Just leave it up to each branch of the switch to fetch another argument if needed. In fact, why not do that for one-argument opcodes as well? If what you say is true, that should make argument decoding even faster than it is now! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Sure. That's not how the patch is currently implemented though -- it adds this to the main opcode decoding: if (HAS_2ARG(opcode)) oparg2 = NEXTARG();
Yes -- but also would cause about 30-40 copies of the same code (which could be a single macro call). This could easily be tested and timed though. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Guido van Rossum wrote:
Yes, and we won't be able to conclude anything. This is micro-optimization which doesn't give meaningful results. Actually, when I played last time with the main loop (it was for 1.5.2 I believe) duplicating the second argfetch doesn't give any speedup. Mainly because the code for the 2nd byte fetch is already in the processor's cache. Consequently, testing and fetching the 2nd argbyte (or skipping it) is faster when it's done before the big switch. If it's done per opcode, the cache may be invalidated. Micro-optimization doesn't worth the effort. As to the 2-byte arg limit, I think I'd be happy if the compiler raises an exception if this limit is exceeded. unrelated-with-PEP-203'ly y's -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Fair enough (for one particular CPU at least).
That would be a good first step indeed! --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Guido van Rossum wrote:
[me, on micro-optimizing the main loop]
Fair enough (for one particular CPU at least).
Since we're at it, it's worth mentioning another conclusion we came across at the time: the cache effects in the main loop are significant -- it is more important to try keeping at best the main loop small enough, so that those effects are minimized. An experiment I did at the time which gave some delta-speedup: I folded most of the UNOP & BINOP opcodes since they differ only by the functon they call and they duplicate most of the opcode body. Like this: Original: case BINARY_POWER: w = POP(); v = POP(); x = PyNumber_Power(v, w, Py_None); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; ... Folded version: case BINARY_POWER: w = POP(); v = POP(); x = PyNumber_Power(v, w, Py_None); goto end_binop; case BINARY_MULTIPLY: w = POP(); v = POP(); x = PyNumber_Multiply(v, w); goto end_binop; ... end_binop: Py_DECREF(v); Py_DECREF(w); PUSH(x); if (x != NULL) continue; break; This reduced the code size of ceval.c, which resulted in less cache effects and gave more speed, despite the additional jumps. It possibly results in less page-faults too, although this is unlikely. Which makes me think that, if we want to do something about cache effects, it is probably not a bad idea to just "reorder" the bytecodes in the big switch by decreasing frequency (we have some stats about this -- I believe Skip and MAL have discussed the opcodes' frequency and the charts lie somewhere in the archives). I remember Marc-Andre had done something in this direction and reported some perf improvements too. Since reordering the opcodes doesn't really hurt, if I'm about to do something with the main loop, it'll be only this.
I'll try to have a look, if someone else is not more reactive than me. -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Yes, that's what Tim keeps hammering on too.
I expect this is wholly attributable to the reduced code size. Most binary operators aren't used frequently enough to make a difference in other ways. If you put the common code at the end of the code for binary '+', that would optimize the most common operator.
Go for it -- sounds good! --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Guido van Rossum <guido@beopen.com>:
Allow me to also suggest liberal use of "inline" for small helper functions called from within the main interpreter loop. The nonlocality produced by subroutine calls plays hell with small caches. Side observation: the fact that micro-optimizations are giving us measurable speedups tells us that the higher-level architecture and algoirithms are very well tuned. Take a bow, Guido! -- <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a> "Government is not reason, it is not eloquence, it is force; like fire, a troublesome servant and a fearful master. Never for a moment should it be left to irresponsible action." -- George Washington, in a speech of January 7, 1790

Guido van Rossum wrote:
+1 from here ;-) I have done quite a bit of testing with the old 1.5 ceval loop (patch still available in case someone wants to look at it) and found these things: 1. It pays off to special case LOAD_FAST by handling it before going into the switch at all, since this is the one most often used opcode in Python. 2. Reordering the opcodes has some noticable effect on performance too, even though it is very touchy to cache lines. 3. Splitting the big switch in two with the first one holding the most of often used opcodes while the second one takes care of the more esoteric ones helps keeping the inner loop in the CPU cache and thus increases performance.

M.-A. Lemburg <mal@lemburg.com>:
Some thoughts: 1. That looks as close to a Poisson distribution as makes no difference. I wonder what that means? 2. Microtuning in the implementations of the top 3 opcodes looks indicated, as they seem to constitute more than 50% of all calls. 3. On the other hand, what do you get when you weight these by average time per opcode? -- <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a> "The bearing of arms is the essential medium through which the individual asserts both his social power and his participation in politics as a responsible moral being..." -- J.G.A. Pocock, describing the beliefs of the founders of the U.S.

On Fri, Jul 28, 2000 at 11:26:46AM -0400, Eric S. Raymond wrote:
LOAD_FAST(124) : 19323126 ================================ SET_LINENO(127) : 15055591 ========================
1. That looks as close to a Poisson distribution as makes no difference. I wonder what that means?
It means there's still room for optimization! Other than that, probably not much. I also think that the similarity vanishes if you take into account that SET_LINENO does nothing, and LOAD_FAST is a basic operation and should be grouped with LOAD_NAME and LOAD_GLOBAL.
2. Microtuning in the implementations of the top 3 opcodes looks indicated, as they seem to constitute more than 50% of all calls.
Actually, I believe there was talk of removing SET_LINENO altogether... or am I mistaken in that ? In any case, you can do it by using -OO.
3. On the other hand, what do you get when you weight these by average time per opcode?
My guess is that *_NAME moves on up, as they already are pretty high up, and BINARY_* and UNARY_*, go *way* up: imagine all those classes that implement __add__ and __not__. Even if you have but a few of those calls, they effectively are (Python) function calls. In other words, those are pretty meaningless. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

"Eric S. Raymond" wrote:
I'd say that there are good chances on applying optimizations to the Python byte code -- someone with enough VC should look into this on a serious basis ;-) I think that highly optimized Python byte code compilers/ interpreters would make nice commercial products which complement the targetted Python+Batteries distros.
2. Microtuning in the implementations of the top 3 opcodes looks indicated, as they seem to constitute more than 50% of all calls.
Separating out LOAD_FAST from the switch shows a nice effect. SET_LINENO is removed by -OO anyway, so there's really no use in optimizing this one. In my hacked up version I've also moved the signal handler into the second switch (along with SET_LINENO). The downside of this is that your program will only "see" signals if it happens to execute one of the less common opcodes, on the plus side you get an additional boost in performance -- if your app doesn't rely on signals to work, this is also a great way to squeeze out a little more performance.
3. On the other hand, what do you get when you weight these by average time per opcode?
Haven't tested this, but even by simply reordering the cases according to the above stats you get a positive response from pybench and pystone. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, Jul 28, 2000 at 06:07:21PM +0200, M.-A. Lemburg wrote:
At OLS I saw David S. Miller's keynote. Miller's been a Linux kernel hacker for quite a long time, and has been responsible for much of the Sparc and MIPS port, maintains the networking stack, etc. In his talk he mentioned that occasionally he gets really tired of kernel-space work, and has to do something in user-space for a change of pace; in that vein he rewrote GCC's Sparc64 backend, and recently wrote an nVidia driver for XFree86. My first thought when he said that was "Gosh, he should look at ceval.c!" --amk

M.-A. Lemburg wrote:
Actually, I think that instead of fighting with SET_LINENO for the standard setup, it would make sense to change the invocation options to something more standard: 1) python - code without SET_LINENO 2) python -g - code for debugging, with SET_LINENO 3) python -O - code without doc-strings. The point is that currently there's no optimization of the code stream at all, in the sense of classic compiler optimization (code block-level optimizations, loop unrolling, etc). And SET_LINENO is actually useful only for debugging (with pdb, etc). What about this as a sane proposal for SET_LINENO? -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

On Fri, 28 Jul 2000 Vladimir.Marangozov@inrialpes.fr wrote:
I would be firmly -1 on this proposal. Without a major payoff - say, double or better performance improvements - i think it's a bad idea to disable debugging in the common case. Python already has -o and -oo - aren't they sufficient? (Maybe i'm misremembering, but i recall that removing SET_LINENO didn't yield a whole lot of performance improvement. Actually, i have the impression it's been a canonical example of how casual optimizations don't do what you expect - but i may be off base here.) I see the ability to debug at whim, even when you didn't expect having to do so, as being a major win, part of the incremental development process. For me, it's a virtue up there with "having the source" in making it easier to fix my own and other people's mistakes. (The inconvenience of having to rerun with debugging enabled is magnified greatly when dealing with connecting to longrunning processes, as i tend to do with zope under zeo and medusa. Then i'm really screwed if someone forgot to start with optional debugging enabled.) If you can replace SET_LINENO's functionality with something that doesn't have the same overhead, great! I have the impression from your recent comments along these lines that that isn't happening immediately... Ken klm@digicool.com

[Ken Manheimer]
That's correct, it's rarely more than just a few percent.
Well, it did what *I* expected <wink>. There are the "state lottery" and "penny saved is a penny earned" schools of optimization. They appeal, respectively, to "no guts, no glory" and "slow and steady wins the race" types. Both are valuable, although the true merits of the latter don't become apparent until after it's been pursued consistently for years. Python hasn't done that (there's been no consistency of effort, just manic "speed it up" bursts), and neither has it pursued any Major Schemes since LOAD_FAST was introduced. The most interesting dropped project "in the middle" was Skip Montanaro's Rattlesnake, which aimed to change the VM from a stack to a register model.
Same here! But I'm a little warped <snort>: I worked on optimization for a living for so many years, that one reason I was attracted to Python is that it didn't do *any* optimization to screw up my view of what my code was doing. Any debugger that claims to work well with "for real" optimized code in any language is lying. So the only optimizations I've ever supported in default-mode Python are things the user can't detect via semantics or usability. So I'm also -1 on [Vladimir]
unless SET_LINENO isn't needed for debugging. The speedup -O yields today is real and consistent, but rarely makes a difference worth worrying about (the only times it's mattered for me were in extremely well-tested CPU-bound programs that I intended to run for days straight, and then I appreciated the few hours it saved). python-the-choice-for-a-slow-generation<wink>-ly y'rs - tim

[Tim & Ken voting -1 and arguing about]
To be honest, I was -1 myself shortly after sending the message. The consequences, at least for IDLE, make this truly insane, unless, as you both pointed out, SET_LINENO's functionality is achieved without it. As a matter of fact, its functionality *can* be achieved, at least for debugging purposes. That is, for generating the callbacks from the main loop to a Python function, called on every new source line. This is what basically the patch I ended up with does. To clarify the current state of the challenge to remove SET_LINENO, here's a short summary of the idea we dicussed previously and which I implemented (the patch does look scary without explanation <wink>). Consider the following code sequence diagram (current state), assuming that each position is an opcode, arguments included: 0 1 2 3 01234567890123456780123456780123 co_code -> [#....#......#....#..#....#....] where '.' - an opcode ^ '#' - SET_LINENO IP (instruction pointer) Whenever a sys.settrace function is set, it is called on every # and this is basically how the debugger gets control for each source line. It finds out the source line by reading it from the current frame - f.lineno Now, here's the picture with the patch intended to obviate SET_LINENO: 0 1 2 3 01234567890123456780123456780123 co_code -> [........................] the same co_code w/o SET_LINENO copy of -> [+...+.....+...+.+...+...] '+' is a 1-byte CALL_TRACE opcode co_code ^ overriding the 1st opcode of each internal to IP new source line eval_code2() Whenever a sys.settrace function is set, a copy of the original code stream is created and the IP is redirected to execute the copy before entering the main loop. In this copy, the 1st instruction for each line is set to CALL_TRACE. These locations are computed from the co_lnotab table. CALL_TRACE updates f.lineno and invokes the sys.settrace function. On return from sys.settrace, it reads the opcode byte it has overwritten from the original co_code and jumps directly to the argument fetch before the big switch. All this works like a charm, except that, a priori, it has 2 drawbacks regarding full compatibility with the current state with SET_LINENO: a) f.lineno is not updated on a regular basis like SET_LINENO does; only when sys.settrace is set & CALL_TRACE is called. b) the IP is redirected to the copy of co_code before entering the main loop. In practical terms, this means that only entire code objects are traceable, that is, tracing can't start in the middle of a code object. I am not sure whether these 2 things hurt. If so, probably a) hurts more than b), but this is where I stopped working on this at the time, being short of more ideas... If you have one, you're welcome :-) P!ng?? Florx bignal zupkin moognacious today? If not, this scheme probably looses due to a) and b) and SET_LINENO has to stay, at least, for full backward compatibility. Conclusion: the functionality is achieved, full back/compat is not :-/ -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Hasn't backward compatibility already been broken for 2.0? So why not break it a little more? It always seemed odd to me that the current line number is always kept up to date, even though 99.999% of the time, no one will care. Why not just keep a small table that holds the offset in the bytecode at which each line starts, and look it up when it's needed? --amk

Hasn't backward compatibility already been broken for 2.0? So why not break it a little more?
Sure -- within reason.
That already exists. Search for co_lnotab, and for PyCode_Addr2Line(). --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Guido van Rossum wrote:
Ok folks. I've uploaded an updated patch on SF against CVS if you want to play with it. IDLE, at least, seems to be happy. I'm not sure whether something else relies on f.lineno and whether it is now broken because of the lack of lineno updates. (Note that you need to delete your .pyc files so that the compiler can regenerate the code without SET_LINENO; if SET_LINENO are in the code, you wouldn't see any difference...) -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

[Andrew Kuchling]
As Guido said, we already do -- indeed, that's how Python manages to produce line numbers in tracebacks under -O mode (which inhibits generation of SET_LINENO opcodes). Your next job <wink> is to think about how, e.g., you can set a breakpoint on a specific line of code in the debugger. Take a quick peek at the implementation of SET_LINENO in ceval.c, and you'll see that's where the trace hook is implemented. No SET_LINENO, no trace callback. That's what Vladimir has been trying to solve: the trick is arranging to pay something at runtime for it only when you're actually using it. debuggers-always-need-dirty-tricks-ly y'rs - tim

amk> It always seemed odd to me that the current line number is always amk> kept up to date, even though 99.999% of the time, no one will care. amk> Why not just keep a small table that holds the offset in the amk> bytecode at which each line starts, and look it up when it's amk> needed? (I'm probably going to wind up seeming like a fool, responding late to this thread without having read it end-to-end, but...) Isn't that what the code object's co_lnotab is for? I thought the idea was to dispense with SET_LINENO altogether and just compute line numbers using co_lnotab on those rare occasions (debugging, tracebacks, etc) when you needed them. Skip

Skip Montanaro wrote:
Don't worry about it anymore. It's all in Postponed patch #101022 at SF. It makes the current "-O" the default (and uses co_lnotab), and reverts back to the current default with "-d". I give myself a break on this. You guys need to test it now and report some feedback and impressions. If only to help Guido making up his mind and give him a chance to pronounce on it <wink>. [?!ng]
Which one? They are all the latest <wink>. See also the log msg of the latest tiny patch update at SF. -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

[Vladimir on removing SET_LINENO opcodes per default] Is there any reason why we can't just use traceback.tb_lineno() + maybe an equivalent C function instead of adding useless opcodes into the byte code stream ? The line number information is already stored in the co_lnotab table, so SET_LINENO is redundant in some respects. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Yes - you seem to have missed half the story <wink>. How would you generate callbacks (fire "line" events) from within the mainloop when a sys.settrace function is set and the PC starts executing opcodes corresponding to a new line number? Note that we don't "add" any new opcodes, and in the scheme I presented CALL_TRACE is also internal to eval_code2(), like the copy of co_code, but I eventually decided to show it in opcodes.h. SET_LINENO is not generated anymore and is reduced to a NOOP in ceval, CALL_TRACE is introduced only for the callbacks. For b/w compatibility (of .pyc files) I think we can't just "get rid" of it. And now that the game is over with finding the solution to f.f_lineno's access, the proposal about "python -g" which preserves SET_LINENO makes sense (at least for visualizing SET_LINENO in a disassembly). -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

On Mon, Jul 31, 2000 at 12:16:17PM +0200, Vladimir Marangozov wrote:
Why not, except for the possible desirability of a -g flag that does use it? If you up the bytecode magic, which several other things in the pipeline are going to do anyway (augmented assignment, range literals, list comprehensions) old bytecode will be recompiled, and new bytecode will not be used on old interpreters. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Vladimir Marangozov wrote:
Good question :-) You're right: I forgot about the trace function call in the SET_LINENO opcode.
With "add" I meant the SET_LINENO opcode... remove the "useless" from my comment above ;-)
BTW, we already have "python -d" which sets the debugging flag. You could simply use that flag for generating the SET_LINENO opcodes. Still, I think that the SET_LINENO issue is not really all that important: we have "python -O" which removes the opcodes. Perhaps moving the opcode a bit further down in the big switch would help CPU caches for production code (running under "python -O")... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
You're right. In case someone is not on the patches list and wants to test this, here's the log message of the latest patch update: Date: 2000-Jul-31 07:57 By: marangoz Comment: Another rewrite, making this whole strategy b/w compatible according to the 1st incompatibility point a) described in: http://www.python.org/pipermail/p ython-dev/2000-July/014364.html Changes: 1. f.f_lineno is computed and updated on f_lineno attribute requests for f, given f.f_lasti. Correctness is ensured because f.f_lasti is updated on *all* attribute accesses (in LOAD_ATTR in the main loop). 2. The standard setup does not generate SET_LINENO, but uses co_lnotab for computing the source line number (e.g. tracebacks) This is equivalent to the actual "python -O". 3. With "python -d", we fall back to the current version of the interpreter (with SET_LINENO) thus making it easy to test whether this patch fully replaces SET_LINENO's behavior. (modulo f->f_lineno accesses from legacy C code, but this is insane). IMO, this version already worths the pain to be truly tested and improved. One improvement is to define a nicer public C API for breakpoints: - PyCode_SetBreakPointAtLine(line) - PyCode_SetBreakPointAtAddr(addr) or similar, which would install a CALL_TRACE opcode in the appropriate location of the copy of co_code. Another idea is to avoid duplicating the entire co_code just for storing the CALL_TRACE opcodes. We can store them in the original and keep a table of breakpoints. Setting the breakpoints would occur whenever the sys.settrace hook is set. ------------------------------------------------------- ------------------------------------------------------- For more info, visit: http://sourceforge.net/patch/?func=detailpatch&patch_id=101022&group_id=5470 -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Tim> The most interesting dropped project "in the middle" was Skip Tim> Montanaro's Rattlesnake, which aimed to change the VM from a stack Tim> to a register model. 'e's not dead, just restin'... ;-) Skip

On Sat, 29 Jul 2000, Ken Manheimer wrote:
Funny -- i was just about to throw in a hearty +1.
I see the ability to debug at whim, even when you didn't expect having to do so, as being a major win, part of the incremental development process.
But you *can't* debug at whim, right? You have to have started the debugger, so it can hook sys.settrace, before you run your program. If you have to start the debugger anyway, then you can surely say -g. If you *didn't* start the debugger beforehand, the best you can get is a detailed traceback with function names and line numbers -- and you don't lose any of that when SET_LINENO goes away. Let's suppose: 1. -g is turned on by default in interactive mode, since we don't know if you're going to feel like tracing today. 2. Programs (like debuggers and IDEs) that know they're going to use sys.settrace, or which contain embedded interactive interpreters, begin with #!/usr/bin/env python -g 3. Other programs begin with the usual #!/usr/bin/env python Given this -- are there any debugging opportunities you're missing out on? I don't see any... -- ?!ng

On Sun, 30 Jul 2000, Ka-Ping Yee wrote:
Nope. More often than not, if i'm developing something substantial, then my interpreter sessions will involve developing context, running something then having it fail, doing a pdb.pm() post mortem, and then doing a second run with pdb.run(). My life would be worse if i happened to forget to start python with the '-g' in the first place. This scenario happens even when i don't intend to be "developing", ie when i encounter bugs in random code i'm using. As it stands, i (and anyone) have recourse to poke around at will. This is the kind of thing that makes an interpreted language valuable to me in the first place, and why i'm willing to trade of a lot more than a few percent performance. (This all may be moot if the linenotab-based alternative pans out - then everyone can be happy.)
This would address my above concerns.
One important one would be connecting with a long-running process, eg something running under medusa via the medusa monitor client. Unfortunately, it looks like the monitor client currently doesn't use a pty, or uses it wrong, or something, so the debugger doesn't work right anyway - which has been a royal pain on many occasions! I could see it being worthwhile to run the server process in an economy mode, if the economy were big. For some definition of big, of course.-) Ken klm@digicool.com

I tried this and found about three percent speed increase on pystone, for what that's worth. This is with python -OO on Linux x86. Note that removing the (now redundant) case from the switch seemed to make a small difference too. Alas, I have no time to play with optimizing the main loop in a more rigorous way... :-( Here's the patch I came up with: Index: ceval.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Python/ceval.c,v retrieving revision 2.187 diff -c -r2.187 ceval.c *** ceval.c 2000/07/25 12:56:38 2.187 --- ceval.c 2000/07/30 16:13:23 *************** *** 608,616 **** f->f_lasti = INSTR_OFFSET(); #endif opcode = NEXTOP(); ! if (HAS_ARG(opcode)) oparg = NEXTARG(); #ifdef DYNAMIC_EXECUTION_PROFILE #ifdef DXPAIRS dxpairs[lastopcode][opcode]++; --- 608,631 ---- f->f_lasti = INSTR_OFFSET(); #endif + get_opcode: opcode = NEXTOP(); ! if (HAS_ARG(opcode)) { oparg = NEXTARG(); + if (opcode == LOAD_FAST) { + x = GETLOCAL(oparg); + if (x != NULL) { + Py_INCREF(x); + PUSH(x); + goto get_opcode; + } + PyErr_SetObject(PyExc_UnboundLocalError, + PyTuple_GetItem(co->co_varnames, + oparg)); + goto on_error; + } + } + #ifdef DYNAMIC_EXECUTION_PROFILE #ifdef DXPAIRS dxpairs[lastopcode][opcode]++; *************** *** 1282,1300 **** } Py_INCREF(x); PUSH(x); - break; - - case LOAD_FAST: - x = GETLOCAL(oparg); - if (x == NULL) { - PyErr_SetObject(PyExc_UnboundLocalError, - PyTuple_GetItem(co->co_varnames, - oparg)); - break; - } - Py_INCREF(x); - PUSH(x); - if (x != NULL) continue; break; case STORE_FAST: --- 1297,1302 ----

Guido van Rossum wrote:
Some speedup confirmed on my Linux PIII too. But you saved a round-trip around the ticker test. If this is acceptable, then let's do it for the opcodes involving only stack operations (probably not the JUMPs) _and_ which do not contain DECREFs which may trigger an external call. Here's the picture on my machine (I get the same picture with -OO): cvs - the original ceval.c in CVS load_fast - ceval.c with your patch top5 - ceval.c with my patch at SF moving 5 opcodes to the top top5-loadfast - my patch and your patch ~/python/dev>./python-cvs Lib/test/pystone.py Pystone(1.1) time for 120000 passes = 19.85 This machine benchmarks at 6045.34 pystones/second ~/python/dev>./python-load_fast Lib/test/pystone.py Pystone(1.1) time for 120000 passes = 19.61 This machine benchmarks at 6119.33 pystones/second ~/python/dev>./python-top5 Lib/test/pystone.py Pystone(1.1) time for 120000 passes = 18.87 This machine benchmarks at 6359.3 pystones/second ~/python/dev>./python-top5-load_fast Lib/test/pystone.py Pystone(1.1) time for 120000 passes = 19.08 This machine benchmarks at 6289.31 pystones/second Which shows, among others, that important cache effects are still here, bacause "python-top5-load_fast" is slower than "python-top5" alone... no-more-time-for-it-from-me-either'ly y'rs -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

[Vladimir Marangozov]
I would rather call them "second-order effects" (SOE), as we can actually have no idea what causes them without a whale of a lot of work. They *could* be unfortunate accidents in the data or instruction caches, or *could* be that rearranging the code caused the compiler to spill a variable to memory that would have been better off in a register, or *could* be that the compiler optimizer runs out of steam in a different place than before and so generates sub-optimal code in any number of other respects, or *could* be that the sequence of jumps just happened to change in a way that causes one to get mis-predicted by the HW, or causes the HW prediction cache to flush at a bad moment, or etc etc etc. A gloss: almost all interesting optimizations a compiler performs, from register assignment to instruction scheduling, take exponential time to optimize "for real". No compiler can afford that much time (although there are some platform-specific "superoptimizers" out there that you can run for days over small pieces of code), so in practice compilers string together a long chain of fast but more-or-less fragile heuristic approaches, or even resort to general "mystery schemes" (like simulated annealing). Predicting what pops out the other end generally can't be done easier than by running the optimizer and *seeing* what pops out! So tiny changes can have wonderfully <wink> unexpected effects, and in general you cannot out-think them in advance. And as CPU HW has gotten ever-more complex, SOE has spread to cover a much larger territory than it used to (note that Intel doesn't even pretend to document the actual P3 timing rules anymore -- they've gotten too complicated to explain! compilers only work from an *approximation* to HW truth now). That said, there's good cross-platform reason to believe that Vladimir's "top 5" rearrangement is a helpful change, and more helpful than sucking out one popular opcode. It also has less potential to trigger harmful SOE, because it doesn't change the basic-block structure of the code (optimizers crawl over a graph of the basic blocks, and changes to the *topology* of the graph can cause heuristics to "get stuck" in new & exciting places <0.9 wink>; this is important in ceval because the loop is soooooo big -- optimizers are likely to fall into a state of unstable equilibrium when chewing over this much code). +1 on top5 because it's an easy change that's likely to help and unlikely to hurt. -0 on peeing away more time on micro-optimization right now. then-again-if-it's-the-only-way-guido-can-take-time-to-actually- work-on-python...<wink>-ly y'rs - tim

M.-A. Lemburg wrote:
If you need help, I can dig up those old tools and patches...
Yes, please do. I think I'll come up with a patch posted to SF for your collective review. [Eric Reymond, on opcode stats by MAL]
Well, it's difficult to say what this means without looking at the tools that were used to generate these stats.
2. Microtuning in the implementations of the top 3 opcodes looks indicated, as they seem to constitute more than 50% of all calls.
Imagine what will happen if SET_LINENO disappears <wink> But this is very tricky business which is more complicated than it looks like...
3. On the other hand, what do you get when you weight these by average time per opcode?
I haven't run 100M opcodes, but you may want to have a look at some old micro-profiling I did long time ago: http://starship.python.net/~vlad/tprof/ The relevant file for the main loop is: http://starship.python.net/~vlad/tprof/pybench-0.6/python-151-orig-thr/__t.e... I am not sure this makes any sense by now, though. -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Imagine what will happen if SET_LINENO disappears <wink> But this is very tricky business which is more complicated than it looks like...
Can someone summarise what the issues are with SET_LINENO? I had the impression that they're not needed for finding the line numbers in tracebacks. So what would we lose if they disappeared? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

The debugger (and quite possibly the profiler - I havent checked!) As Tim said only a few hours ago: --- Your next job <wink> is to think about how, e.g., you can set a breakpoint on a specific line of code in the debugger. Take a quick peek at the implementation of SET_LINENO in ceval.c, and you'll see that's where the trace hook is implemented. No SET_LINENO, no trace callback. That's what Vladimir has been trying to solve: the trick is arranging to pay something at runtime for it only when you're actually using it. debuggers-always-need-dirty-tricks-ly y'rs - tim --- Mark.

the trick is arranging to pay something at runtime for it only when you're actually using it.
When tracing is in effect, switch to another copy of ceval2() which contains extra code to do whatever grubby things are needed to detect the start of a line using the line starts table. (Search it before executing each opcode, expand it into a parallel array of flags, or whatever.) The two versions of ceval2() could be generated from the same source file with some #defines. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Greg Ewing wrote:
Huh? See my Addendum: to Mark's message that I just posted, which explains the whole story in detail. I forgot to change the subject line which still reads "Opcode reordering". Ah wait... from the archives: http://www.python.org/pipermail/python-dev/2000-July/014395.html If you think your suggestion still holds, then could you elaborate a bit more? -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

[me, on explaining the deal with SET_LINENO]
Which makes me think about the tstate->ticker. This is the only variable updated for every opcode. If we can manage to involve the ticker in the computation of f.lineno, having f.lasti and interp->checkinterval and whatever variables are logged somewhere, we're about to win. But it's likely to be messy, because thread states and frames would be involved. If this is possible, we'll trap f.lineno's access in frameobject.c and through this magic formula we could get an updated f.lineno on every Python request. Could someone think of a "magic formula" to compute the PC (the index of the current opcode), which is enough for computing f.lineno with PyCode_Addr2Line()? Guido? Tim? Anyone? -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Mark Hammond wrote:
Addendum: the debugger in its current state. And the whole point of the solution we were thinking of is to make sure that the debugger (and other objects, like tracebacks) do read the line number only when the main loop fires a "line" event via a hook. Given the state of the art, however, nobody can guarantee that the line number is not fetched in legacy code from the frame directly, through f.lineno (for whatever purpose). Note that we can trap the access to the f.lineno attribute in frameobject.c, but the problem boils down to computing the line number from the current instruction, i.e. from 'f.lasti', because the only thing we have at our disposal is a table (co_lnotab) which maps line number offsets and instruction offsets (through delta increments). Note that the f.lasti attr is not updated as it should in the main loop, because it would infer a lot of overhead (to be up to date, it needs to be updated for every opcode). It is updated only for 'important' events: a trace hook, an exception, a Python function call. So, at the heart of the difficulty here is the fact that we can't figure out precisely where we are in the opcode stream, if we ask that from Python (because the program counter is not stored anywhere regularly). We need to get notified by the main loop. And whenever you ask 'f.lineno' from Python, well, you're asking this information from the Python side (without notification from the C side). Do you follow me up to this point? :-) Now, with SET_LINENO, we are sure that f.lineno is always up to date, because its purpose is to update f.lineno (on one hand) *and* to fire a "line" event from the main loop if a hook is in place. The compiler makes sure to insert SET_LINENO opcodes for each new source line (this is why, BTW, you can end up with 10 SET_LINENO in a row if you have 10 doc strings in a row that describe some Python function). Without SET_LINENO, we have to deal with these two problems: firing "line" events whenever a tracing function is set and updating f.lineno. We can manage to fire a "line" event (through the scheme I've explained previously, or similar), but we still can't (well, I don't see a reliable way so far) update f.lineno permanently. that's-all-<wink>'ly y'rs -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

[Vladimir Marangozov]
I really don't think we care about that. Guido, do we <wink>? f_lineno is conceptually an internal implementation detail of frame objects, and Guido has often warned that if you access the internals you're on your own (I don't think even Michael's bytecodehacks cares about this one <wink>). So long as all uses of f_lineno in the std distribution can be faked, I think it's fine to just get rid of that member. WRT your later msg, strongly doubt tstate->ticker will help: it's an integer in range(tstate->interp->checkinterval + 1), and wraps around over & over. Besides, the modular base (checkinterval) can be changed by the user at any time. tstate->ticker is thus but the circular shadow of a legion of ghosts. It isn't worth all these brain cells just to preserve an internal detail! OK, it would be worth it if it were easy <wink> -- but it's not.

Tim Peters wrote:
It is! <wink>
OK, it would be worth it if it were easy <wink> -- but it's not.
It is! I think I found the solution :-) Whenever f.f_lineno is accessed, the compiler generates a LOAD_ATTR relative to f. Therefore, updating f->f_lasti = INSTR_OFFSET() in LOAD_ATTR + trapping "lineno" access in frameobject.c with PyCode_Addr2Line(f->f_lasti) ensures that we'll always get the right line number. Of course, this makes the update of f->f_lasti for every LOAD_ATTR, but given the ranking of SET_LINENO compared to LOAD_ATTR in the DXP and MAL's frequency charts, I think this is a pure win! Et voila <wink>. I'll update my patch at SF later. now-back-to-work'ly y'rs -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Mark Hammond wrote:
Yes, traceback info comes from c_lnotab and doesn't depend on SET_LINENO. Only tracing and f.lineno depend on SET_LINENO.
So what would we lose if they disappeared?
The debugger (and quite possibly the profiler - I havent checked!)
I'm still a little puzzled on this one. The only way you can view a running Python program in the debugger is to have started it within the debugger, right? I don't think it's possible to attach a debugger to an already running program. Hence -- if you had to know beforehand that you were going to start the debugger anyway, then you clearly have the opportunity to say -g whenever you want to do debugging. It seems to me that any debugger (or IDLE et al.) can just start with #!/usr/bin/env python -g and we could then drop SET_LINENO by default. Am i missing something? On Mon, 31 Jul 2000, Vladimir Marangozov wrote:
This is okay. The *only* time this is going to make any difference is when a function is trying to inspect the location of the PC within itself. (When the frame is being inspected from anywhere else, f.lasti will be up to date because it will have been set on function call.) I think we don't need to care so much about a function looking at its own f.lineno. Just use PyCode_Addr2Line on f.lasti whenever someone asks for f.lineno, and perhaps document the caveat for anyone weird enough to write navel-gazing functions. :) -- ?!ng

Not really.
I don't think it's possible to attach a debugger to an already running program.
IDLE and Pythonwin are able to debug arbitary programs once they have started - and they are both written in Python. In addition, most debuggers have a technique for breaking into the debugger from code - ie, a hard break-point. pdb.set_trace() is how pdb does it. Pythonwin has one. IDLE either does, or will grow one. I can see a number of other problems: * You do not want to debug the IDE itself, just a tiny bit of code running under the IDE. Making the IDE take the full hit simply because it wants to run a debugger occasionally isnt fair. Also, people tend to start these IDEs once, and keep them running during a number of discrete tasks. Only a few of these tasks may involve firing up the debugger. Asking them to restart the IDE simply to enable the debugger would be a pain for the user. Worse, these IDEs are often started from a GUI, rather than from the command line. This would mean we need 2 (in Windows parlance) shortcuts - "IDLE" and "IDLE with debugging". The end result is that all IDEs will run with debugging enabled. * Python often is embedded, for example, in a Web Server, or used for CGI. It should be possible to debug these programs directly. It shouldnt be necessary to (eg) change the CGI settings so Python is invoked with a special flag whenever you want to use a debugger. It can be hard enought to setup CGI as it is, let alone trying to make it's behaviour depend on if a script needs to be debugged or not. * We should not force Python embedders to handle this themselves. For example, asking someone who embeds Python to support a command-line switch to enable debugging Python is a PITA. It may not even be possible. With Active Scripting, for example, the host may not know it is even using Python at all! It all seems to push the problem down to those who can least afford to manage it, and least equipped to understand how it needs to be tweaked for their particular situation. However, I could agree that the debugger itself trigger debuggable behaviour. I just dont believe this should be a startup-only switch. sys.set_debugging(1) would work for me. OTOH, a debugger always announces itself by calling sys.settrace(), so this is really all that is necessary (and debugging can be disabled whenever sys.settrace(None) is called, and no existing frames have their tracer set.) Mark.

On Mon, 31 Jul 2000, Mark Hammond wrote:
IDLE and Pythonwin are able to debug arbitary programs once they have started - and they are both written in Python.
But only if you start them *in* IDLE or Pythonwin, right?
Well, running with trace hooks in place is no different from the way things run now.
The end result is that all IDEs will run with debugging enabled.
Right -- that's what currently happens. I don't see anything wrong with that.
* Python often is embedded, for example, in a Web Server, or used for CGI. It should be possible to debug these programs directly.
But we don't even have a way to do this now. Attaching to an external running process is highly system-dependent trickery. If printing out tracebacks and other information isn't enough and you absolutely have to step the program under a debugger, the customary way of doing this now is to run a non-forking server under the debugger. In that case, you just start a non-forking server under IDLE which sets -g, and you're fine. Anyway, i suppose this is all rather moot now that Vladimir has a clever scheme for tracing even without SET_LINENO. Go Vladimir! Your last proposal sounded great. -- ?!ng

Ka-Ping Yee writes:
This #! isn't portable; the program is only guaranteed to get one parameter from the #! line ("python" in this example). Some platforms are more forgiving and just use the rest of the command line (or a maximum of 32 characters), but that means you lose the "-g" on too many platforms. The other concerns that have been brought up are fairly important as well. -Fred -- Fred L. Drake, Jr. <fdrake at beopen.com> BeOpen PythonLabs Team Member

"M" == M <mal@lemburg.com> writes:
M> These stats were created using an instrumented Python M> interpreter running real applications (and recording all M> executed opcodes to a file), so they can be considered close M> enough to what Python typically does in its inner loop. M> If you need help, I can dig up those old tools and patches... If you can update them for the current CVS, I'd suggest uploading them to SF. I'd then be willing to run Mailman for a while on an instrumented Python to see what it does. -Barry

"Barry A. Warsaw" wrote:
I'll see what I can do... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Yes, and we won't be able to conclude anything. This is micro-optimization which doesn't give meaningful results.
That's why I said "if what you say is true", i.e. if the claimed slowdown from the extra argument decoding step actually exists. There are two possibilities: (1) It really does slow down the execution of all bytescodes. In that case, move it into the handler code for the relevant bytecodes, where it won't affect the speed of anything else. (2) Otherwise, leave it where it is. Either way, I don't see any justification to reject the idea of having 2-arg opcodes because of speed, which is what the original post seemed to be saying. Or maybe I misunderstood. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Greg Ewing wrote:
The latter. I was objecting to Guido's suggestion to time it and see... I'm not opposed to the idea, although it is likely to slow things down. But assuming we don't talk about performance, I don't see a reason for not thinking about a generalized scheme with variable-size arguments, (cf. classical CPU instruction sets) that is, instructions which operate on 1-, 2- or 4-byte arguments. if-you-wanna-fix-it-do-it-like-a-man-<wink>'ly y'rs -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

NO!!! If my __getslice__ method has a bug which results in a type error, I want to get a traceback about it, not have it silently swallowed. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Thu, 27 Jul 2000, Guido van Rossum wrote:
Second, what should happen to a slice assignment?
My vote is: nothing! (Read on.)
The basic slice form is:
a[i:j] += b
What does this mean? I don't see how it could be any different from: a[j:j] = b If people want to insert sequences into other sequences, they should use this form; if they want to insert an element into a sequence, they should be encouraged to use an .insert() method. In other words, we already have perfectly good ways of doing this (more importantly, there are already clear and simple ways to implement such behaviour in your own objects).
a[:, ..., ::, 0:10:2, :10:, 1, 2:, ::-1] += 1
What on earth could this mean? And why would anyone want it? I can't even imagine what the simpler case a[i:j:k] += 1 would do... unless you wish to propose element-wise operators (e.g. add 1 to each of the elements that would participate in the slice) and i think that might be a whole new can of worms. It wasn't clear to me that this was your intent, though. It looks to me like going through various contortions to support augmented assignment to slices is not going to be worth the trouble. May i suggest >>> a[i:j] += b SyntaxError: augmented assignment to a slice is not allowed just like >>> (a, b, c) += 1,2,3 SyntaxError: augmented assignment to a tuple is not allowed ? -- ?!ng

On Thu, Jul 27, 2000 at 10:10:51AM -0700, Ka-Ping Yee wrote:
On Thu, 27 Jul 2000, Guido van Rossum wrote:
The basic slice form is:
a[i:j] += b
What does this mean? I don't see how it could be any different from:
a[j:j] = b
a[i:j] += b means exactly the same as: a[i:j] = a[i:j] + b Whether it does anything other than raising an exception depends entirely on the types of a and b !
I can't even imagine what the simpler case
a[i:j:k] += 1
would do... unless you wish to propose element-wise operators (e.g. add 1 to each of the elements that would participate in the slice)
That's entirely user-defined. Augmented assignment simply extends Pythons extremely liberal (my cultural heritage tempts me to say 'communist';) semantics in these cases. a[i:j:k] += 1 is exactly a[i:j:k] = a[i:j:k] + 1 If 'a' is a Python class, this would turn into (forgetting about order of evaluation, for a second): a.__setitem__(slice(i, j, k), a[slice(i, j, k)].__add_ab__(1))
It looks to me like going through various contortions to support augmented assignment to slices is not going to be worth the trouble.
May i suggest
>>> a[i:j] += b SyntaxError: augmented assignment to a slice is not allowed
Sure, but it doesn't make sense unless 'a[i:j] = a[i:j] + b' raises similar problems. Why this arbitrary border ? Because you can't imagine people wanting it ? -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

[Guido]
Second, what should happen to a slice assignment?
[Ping]
Apparently you haven't used NumPy arrays (matrices). I haven't either, but I hear they overload + so that A+B is elementwise addition, and you can write A+1 to add 1 to each element of A. Thus, for a NumPy programmer, A += 1 would certainly look like A = A+1, and similar for A += B.
That's what I thought too until I realized that for NumPy arrays slice+= makes sense. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

[Don't be scared, I'm revisiting this thread for a purpose -- this isn't a time jump ;-)] On Thu, Jul 27, 2000 at 12:59:15AM -0500, Guido van Rossum wrote:
[ And later you gave an example of slicing using slice objects, rather than the *SLICE+x opcodes ] I have two tiny problems with making augmented assignment use the current LOAD/STORE opcodes in the way Guido pointed out, above. One has to do with the order of the arguments, and the other with ROT_FOUR. And they're closely related, too :P The question is in what order the expression x += y is evaluated. x = y evaluates 'y' first, then 'x', but x + y evaluates 'x' first, and then 'y'. x = x + y Would thus evaluate 'x', then 'y', and then 'x' (for storing the result.) (The problem isn't with single-variable expressions like these examples, of course, but with expressions with sideeffects.) I think it makes sense to make '+=' like '+', in that it evaluates the lhs first. However, '+=' is as much '=' as it is '+', so it also makes sense to evaluate the rhs first. There are plenty of arguments both ways, and both sides of my brain have been beating eachother with spiked clubs for the better part of a day now ;) On the other hand, how important is this issue ? Does Python say anything about the order of argument evaluation ? Does it need to ? After making up your mind about the above issue, there's another problem, and that's the generated bytecode. If '+=' should be as '=' and evaluate the rhs first, here's what the bytecode would have to look like for the most complicated case (two-argument slicing.) a[b:c] += i LOAD i [i] LOAD a [i, a] DUP_TOP [i, a, a] LOAD b [i, a, a, b] DUP_TOP [i, a, a, b, b] ROT_THREE [i, a, b, a, b] LOAD c [i, a, b, a, b, c] DUP_TOP [i, a, b, a, b, c, c] ROT_FOUR [i, a, b, c, a, b, c] SLICE+3 [i, a, b, c, a[b:c]] ROT_FIVE [a[b:c], i, a, b, c] ROT_FIVE [c, a[b:c], i, a, b] ROT_FIVE [b, c, a[b:c], i, a] ROT_FIVE [a, b, c, a[b:c], i] INPLACE_ADD [a, b, c, result] STORE_SLICE+3 [] So, *two* new bytecodes, 'ROT_FOUR' and 'ROT_FIVE', just to get the right operands in the right place. On the other hand, if the *left* hand side is evaluated first, it would look like this: a[b:c] += i LOAD a [a] DUP_TOP [a, a] LOAD b [a, a, b] DUP_TOP [a, a, b, b] ROT_THREE [a, b, a, b] LOAD c [a, b, a, b, c] DUP_TOP [a, b, a, b, c, c] ROT_FOUR [a, b, c, a, b, c] SLICE+3 [a, b, c, a[b:c]] LOAD i [a, b, c, a[b:c], i] INPLACE_ADD [a, b, c, result] STORE_SLICE+3 [] A lot shorter, and it only needs ROT_FOUR, not ROT_FIVE. An alternative solution is to drop ROT_FOUR too, and instead use a DUP_TOPX argument-opcode that duplicates the top 'x' items: LOAD a [a] LOAD b [a, b] LOAD c [a, b, c] DUP_TOPX 3 [a, b, c, a, b, c] SLICE+3 [a, b, c, a[b:c]] LOAD i [a, b, c, a[b:c], i] INPLACE_ADD [a, b, c, result] STORE_SLICE+3 [] I think 'DUP_TOPX' makes more sense than ROT_FOUR, as DUP_TOPX could be used in the bytecode for 'a[b] += i' and 'a.b += i' as well. (Guido's example would become something like this: a[b] += i LOAD a [a] LOAD b [a, b] DUP_TOPX 2 [a, b, a, b] BINARY_SUBSCR [a, b, a[b]] LOAD i [a, b, a[b], i] INPLACE_ADD [a, b, result] STORE_SUBSCR [] So, *bytecode* wise, evaluating the lhs of '+=' first is easiest. It requires a lot more hacking of compile.c, but I think I can manage that. However, one part of me is still yelling that '+=' should evaluate its arguments like '=', not '+'. Which part should I lobotomize ? :) -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

[Thomas]
Yes. And note that the Python reference manual specifies the execution order (or at least tries to) -- I figured that in a user-friendly interpreted language, predictability is more important than some optimizer being able to speed your code up a tiny bit by rearranging evaluation order.
I say that in x += y, x should be evaluated before y.
Sure.
However, one part of me is still yelling that '+=' should evaluate its arguments like '=', not '+'. Which part should I lobotomize ? :)
That part. If you see x+=y as shorthand for x=x+y, x gets evaluated before y anyway! We're saving the second evaluation of x, not the first one! --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
participants (19)
-
Andrew Kuchling
-
Andrew Kuchling
-
bwarsaw@beopen.com
-
Eric S. Raymond
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Greg Ewing
-
Guido van Rossum
-
Ka-Ping Yee
-
Ken Manheimer
-
M.-A. Lemburg
-
Mark Hammond
-
Mark Hammond
-
Michael Hudson
-
Moshe Zadka
-
Skip Montanaro
-
Thomas Wouters
-
Tim Peters
-
Vladimir.Marangozov@inrialpes.fr