Neil and I have been working on the AST branch for the last week. We're nearly ready to merge the changes to the head. I imagine we'll do it this weekend, barring last minute glitches. There are a few open issues that remain. I'd like to merge the branch before resolving them. Please let me know if you disagree. The current status of the AST branch is that only two tests fail: test_trace and test_symtable. The causes for these failures are described below. We did not merge the current head to the branch again, but I diffed the test suite between head and branch and did not see any substantive changes since the last merge. Some of the finer points of generating the line number table (lnotab) are wrong. There is some very delicate code to support single stepping with the debugger. We'll get that fixed soon, but we'd like to temporarily disable the failing tests in test_trace. The symtable module exposed parts of the internal representation of the old symbol table. The representation changed, and the module is going to need to change. The old module was poorly documented and tested, so I'd like to start over. Again, I'd like to disable a couple of failing tests until after the merge occurs. I don't think the current test suite covers all of the possible syntax errors that can be raised. I'd like to add a new test suite that covers all of the remaining cases, perhaps moving some existing tests into this module as well. I'd like to do that after the merge, which means there may be some loose ends where syntax errors aren't handled gracefully. For those of you familiar with the ast work, I'll summarize the recent changes: We added line numbers to expressions in the AST. There are many cases where a statement spans multiple lines. We couldn't generate a correct lnotab without knowing the lines that expressions occur on. We merged the peephole optimizer into the new compiler and restored PyNode_Compile() so that the parser module works again. The parser module will still expose the old parse trees (just what it's users want). We should probably provide a similar AST module, but I'm not sure if we'll get to that. We fixed some flawed logic in the symbol table for handling nested scopes. Luckily, the test cases for nested scopes are pretty thorough. They all pass now. Jeremy
I'm excited to see work continuing (resuming?) on the AST tree. I don't know how many machines you've been able to test the AST branch on. I have a linux/amd64 machine handy and I've tried to run the test suite with a fresh copy of the ast-branch. test_trace segfaults consistently, even when run alone. You didn't give me the impression that the failure was a segfault, so I'll include more information about it below. With '-x test_trace -x test_codecencodings_kr', I get through the testsuite run. Compared to a build of HEAD, also from today, I get additional failures in test_genexps test_grp test_pwd test_symtable and additional unexpected skips of: test_email test_email_codecs The 'pwd' and 'grp' failures look like they're due to a change not merged from HEAD. I'm not sure what to make of the 'genexps' failure. Is it just a harmless output difference? I didn't see you mention that in your message. Here is some of the relevant-looking output: $ ./python -E -tt ./Lib/test/regrtest.py [...] ********************************************************************** File "/usr/src/python-ast/Lib/test/test_genexps.py", line ?, in test.test_genexps.__test__.doctests Failed example: (y for y in (1,2)) = 10 Expected: Traceback (most recent call last): ... SyntaxError: assign to generator expression not possible Got: Traceback (most recent call last): File "/usr/src/python-ast/Lib/doctest.py", line 1243, in __run compileflags, 1) in test.globs File "<doctest test.test_genexps.__test__.doctests[38]>", line 1 SyntaxError: assignment to generator expression not possible (<doctest test.test_genexps.__test__.doctests[38]>, line 1) ********************************************************************** File "/usr/src/python-ast/Lib/test/test_genexps.py", line ?, in test.test_genexps.__test__.doctests Failed example: (y for y in (1,2)) += 10 Expected: Traceback (most recent call last): ... SyntaxError: augmented assign to tuple literal or generator expression not possible Got: Traceback (most recent call last): File "/usr/src/python-ast/Lib/doctest.py", line 1243, in __run compileflags, 1) in test.globs File "<doctest test.test_genexps.__test__.doctests[39]>", line 1 SyntaxError: augmented assignment to generator expression not possible (<doctest test.test_genexps.__test__.doctests[39]>, line 1) ********************************************************************** [...] test test_grp failed -- Traceback (most recent call last): File "/usr/src/python-ast/Lib/test/test_grp.py", line 29, in test_values e2 = grp.getgrgid(e.gr_gid) OverflowError: signed integer is greater than maximum [...] test test_pwd failed -- Traceback (most recent call last): File "/usr/src/python-ast/Lib/test/test_pwd.py", line 42, in test_values self.assert_(pwd.getpwuid(e.pw_uid) in entriesbyuid[e.pw_uid]) OverflowError: signed integer is greater than maximum The segfault in test_trace looks like this: $ gdb ./python (gdb) source Misc/gdbinit (gdb) run Lib/test/test_trace.py [...] test_10_no_jump_to_except_1 (__main__.JumpTestCase) ... FAIL Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 46912496260768 (LWP 11945)] PyEval_EvalFrame (f=0x652c30) at Python/ceval.c:1994 [1967 case COMPARE_OP:) 1994 Py_DECREF(v); (gdb) print oparg $1 = 10 [PyCmp_EXC_MATCH?] (gdb) pyo v NULL $2 = void #0 PyEval_EvalFrame (f=0x652c30) at Python/ceval.c:1994 #1 0x0000000000475800 in PyEval_EvalFrame (f=0x697390) at Python/ceval.c:3618 #2 0x0000000000475800 in PyEval_EvalFrame (f=0x694f10) at Python/ceval.c:3618 #3 0x0000000000475800 in PyEval_EvalFrame (f=0x649fa0) at Python/ceval.c:3618 [...] #50 0x00000000004113bb in Py_Main (argc=Variable "argc" is not available.) at Modules/main.c:484 (gdb) pystack Lib/test/test_trace.py (447): no_jump_to_except_2 Lib/test/test_trace.py (447): run_test Lib/test/test_trace.py (557): test_11_no_jump_to_except_2 /usr/src/python-ast/Lib/unittest.py (581): run /usr/src/python-ast/Lib/unittest.py (280): __call__ /usr/src/python-ast/Lib/unittest.py (420): run /usr/src/python-ast/Lib/unittest.py (427): __call__ /usr/src/python-ast/Lib/unittest.py (420): run /usr/src/python-ast/Lib/unittest.py (427): __call__ /usr/src/python-ast/Lib/unittest.py (692): run /usr/src/python-ast/Lib/test/test_support.py (692): run_suite /usr/src/python-ast/Lib/test/test_support.py (278): run_unittest Lib/test/test_trace.py (600): test_main Lib/test/test_trace.py (600): <module> I'm not sure what other information from gdb to furnish. Jeff
On Thu, Oct 13, 2005 at 05:08:41PM -0500, jepler@unpythonic.net wrote:
test_trace segfaults consistently, even when run alone.
That's a bug in frame_setlineno(), IMO. It's failing to detect an invalid jump because the lnotab generated by the new compiler is slightly different (DUP_TOP opcode corresponds to a different line).
I'm not sure what to make of the 'genexps' failure. Is it just a harmless output difference? I didn't see you mention that in your message.
It's a bug in the traceback.py module, IMO. See bug 1326077. Neil
Jeremy Hylton wrote:
Some of the finer points of generating the line number table (lnotab) are wrong. There is some very delicate code to support single stepping with the debugger.
With disk and memory sizes being what they are nowadays, is it still worth making heroic efforts to compress the lnotab table? How about getting rid of all the delicate code and replacing it with something much simpler? -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg.ewing@canterbury.ac.nz +--------------------------------------+
At 01:43 PM 10/14/2005 +1300, Greg Ewing wrote:
Jeremy Hylton wrote:
Some of the finer points of generating the line number table (lnotab) are wrong. There is some very delicate code to support single stepping with the debugger.
With disk and memory sizes being what they are nowadays, is it still worth making heroic efforts to compress the lnotab table? How about getting rid of all the delicate code and replacing it with something much simpler?
+1. I'd be especially interested in lifting the current requirement that line ranges and byte ranges both increase monotonically. Even better if the lines for a particular piece of code don't have to all come from the same file. It'd be nice to be able to do the equivalent of '#line' directives for Python code that's generated by other tools, such as parser generators and the like.
Phillip J. Eby wrote:
+1. I'd be especially interested in lifting the current requirement that line ranges and byte ranges both increase monotonically. Even better if the lines for a particular piece of code don't have to all come from the same file.
How about an array of: +----------------+----------------+----------------+ | bytecode index | file no. | line no. | +----------------+----------------+----------------+ Entries are sorted by bytecode index, with each entry applying from that bytecode position up to the position of the next entry. The file no. indexes a tuple of file names attached to the code object. All entries are 32-bit integers. Easy to generate, easy to look up with a binary search, should be big enough for everyone except those generating obscenely huge code objects on 64-bit platforms. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg.ewing@canterbury.ac.nz +--------------------------------------+
At 02:25 PM 10/14/2005 +1300, Greg Ewing wrote:
Phillip J. Eby wrote:
+1. I'd be especially interested in lifting the current requirement that line ranges and byte ranges both increase monotonically. Even better if the lines for a particular piece of code don't have to all come from the same file.
How about an array of:
+----------------+----------------+----------------+ | bytecode index | file no. | line no. | +----------------+----------------+----------------+
Entries are sorted by bytecode index, with each entry applying from that bytecode position up to the position of the next entry. The file no. indexes a tuple of file names attached to the code object. All entries are 32-bit integers.
The file number could be 16-bit - I don't see a use case for referring to 65,000 different filenames. ;) But that doesn't save much space. Anyway, in the common case, this scheme will use 10 more bytes per line of Python code, which translates to a megabyte or so for the standard library. I definitely like the simplicity, but a meg's a meg. A more compact scheme is possible, by using two tables - a bytecode->line number table, and a line number-> file table. In the single-file case, you can omit the second table, and the first table then only uses 6 more bytes per line than we're currently using. Not fantastic, but probably more acceptable. If you have to encode multiple files, you just offset their line numbers by the size of the other files, and put entries in the line->file table to match. When computing the line number, you subtract the matching entry in the line->file table to get the actual line number within that file.
Phillip J. Eby wrote:
A more compact scheme is possible, by using two tables - a bytecode->line number table, and a line number-> file table.
If you have to encode multiple files, you just offset their line numbers by the size of the other files,
More straightforwardly, the second table could just be a bytecode -> file number mapping. The filename is likely to change much less often than the line number, so this file would contain far fewer entries than the line number table. In the case of only one file, it would contain just a single entry, so it probably wouldn't even be worth the bother of special-casing that. You could save a bit more by having two kinds of line number table, "small" (16-bit entries) and "large" (32-bit entries) depending on the size of the code object and range of line numbers. The small one would be sufficient for almost all code objects, so the most common case would use only about 4 bytes per line of code. That's only twice as much as the current scheme uses. Greg
At 08:07 PM 10/14/2005 +1300, Greg Ewing wrote:
Phillip J. Eby wrote:
A more compact scheme is possible, by using two tables - a bytecode->line number table, and a line number-> file table.
If you have to encode multiple files, you just offset their line numbers by the size of the other files,
More straightforwardly, the second table could just be a bytecode -> file number mapping.
That would use more space in any case involving multiple files.
In the case of only one file, it would contain just a single entry, so it probably wouldn't even be worth the bother of special-casing that.
A line->file mapping would also have only one entry in that case.
You could save a bit more by having two kinds of line number table, "small" (16-bit entries) and "large" (32-bit entries) depending on the size of the code object and range of line numbers. The small one would be sufficient for almost all code objects, so the most common case would use only about 4 bytes per line of code. That's only twice as much as the current scheme uses.
That'd probably work.
Phillip J. Eby wrote:
At 08:07 PM 10/14/2005 +1300, Greg Ewing wrote:
More straightforwardly, the second table could just be a bytecode -> file number mapping.
That would use more space in any case involving multiple files.
Are you sure? Most of the time you're going to have chunks of contiguous lines coming from the same file, and the bytecode->filename table will only have an entry for the first bytecode of the first line of each chunk. I don't see how that works out differently from mapping bytecodes->lines and then lines->files.
That'd probably work.
Greg
[Phillip J. Eby]
It'd be nice to be able to do the equivalent of '#line' directives for Python code that's generated by other tools, such as parser generators and the like.
I had such a need a few times in the past, and it was tedious having to do indirections through generated Python for finding the real source as referenced by comments. Yet, granted also that the need has not been frequent, for me. -- François Pinard http://pinard.progiciels-bpi.ca
"Phillip J. Eby" <pje@telecommunity.com> writes:
Even better if the lines for a particular piece of code don't have to all come from the same file.
This seems _fairly_ esoteric to me. Why do you need it? I can think of two uses for lnotab information: printing source lines and locating source lines on the filesystem. For both, I think I'd rather see some kind of defined protocol (methods on the code object, maybe?) rather than inventing some kind of insane too-general-for-the-common-case data structure. Cheers, mwh -- 42. You can measure a programmer's perspective by noting his attitude on the continuing vitality of FORTRAN. -- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html
At 09:23 AM 10/14/2005 +0100, Michael Hudson wrote:
"Phillip J. Eby" <pje@telecommunity.com> writes:
Even better if the lines for a particular piece of code don't have to all come from the same file.
This seems _fairly_ esoteric to me. Why do you need it?
Compilers that inline function calls, but want the code to still be debuggable. AOP tools that weave bytecode. Overloaded functions implemented by combining bytecode. Okay, those are fairly esoteric use cases, I admit. :) However, PyPy already has some inlining capability in its optimizer, so it's not all that crazy of an idea that Python in general will need it.
I can think of two uses for lnotab information: printing source lines and locating source lines on the filesystem. For both, I think I'd rather see some kind of defined protocol (methods on the code object, maybe?) rather than inventing some kind of insane too-general-for-the-common-case data structure.
Certainly a protocol would be nice; right now one is forced to interpret the data structure directly. Being able to say, "give me the file and line number for a given byte offset" would be handy in any case. However, since you can't subclass code objects, the capability would have to be part of the core.
"Phillip J. Eby" <pje@telecommunity.com> writes:
At 09:23 AM 10/14/2005 +0100, Michael Hudson wrote:
"Phillip J. Eby" <pje@telecommunity.com> writes:
Even better if the lines for a particular piece of code don't have to all come from the same file.
This seems _fairly_ esoteric to me. Why do you need it?
Compilers that inline function calls, but want the code to still be debuggable. AOP tools that weave bytecode. Overloaded functions implemented by combining bytecode.
Err...
Okay, those are fairly esoteric use cases, I admit. :) However, PyPy already has some inlining capability in its optimizer, so it's not all that crazy of an idea that Python in general will need it.
Um. Well, _I_ still think it's pretty crazy.
I can think of two uses for lnotab information: printing source lines and locating source lines on the filesystem. For both, I think I'd rather see some kind of defined protocol (methods on the code object, maybe?) rather than inventing some kind of insane too-general-for-the-common-case data structure.
Certainly a protocol would be nice; right now one is forced to interpret the data structure directly. Being able to say, "give me the file and line number for a given byte offset" would be handy in any case.
However, since you can't subclass code objects, the capability would have to be part of the core.
Clearly, but any changes to co_lnotab would have to be part of the core too. Let's not make a complicated situation _worse_. Something I didn't say was that a protocol like this would also let us remove the horrors of functions like inspect.getsourcelines() (see SF bugs passim). Cheers, mwh -- There's an aura of unholy black magic about CLISP. It works, but I have no idea how it does it. I suspect there's a goat involved somewhere. -- Johann Hibschman, comp.lang.scheme
Even better if the lines for a particular piece of code don't have to all come from the same file.
This seems _fairly_ esoteric to me. Why do you need it?
Compilers that inline function calls, but want the code to still be debuggable. AOP tools that weave bytecode. Overloaded functions implemented by combining bytecode.
Err...
Okay, those are fairly esoteric use cases, I admit. :) However, PyPy already has some inlining capability in its optimizer, so it's not all that crazy of an idea that Python in general will need it.
Um. Well, _I_ still think it's pretty crazy.
YAGNI Raymond
[Raymond Hettinger]
Even better if the lines for a particular piece of code don't have to all come from the same file.
YAGNI
I surely needed it, more than once. Don't be so assertive. :-) -- François Pinard http://pinard.progiciels-bpi.ca
At 02:41 PM 10/14/2005 -0400, Raymond Hettinger wrote:
YAGNI
If the feature were there, I'd have used it already, so I wouldn't consider it YAGNI. In the cases where I would've used it, I instead split generated code into separate functions so I could compile() each one with a different filename. Also, I notice that the peephole optimizer contains stuff to avoid making co_lnotab "too complex", although I haven't looked at it to be sure it'd actually benefit from an expanded lnotab format.
Hi! Phillip J. Eby wrote: [snip]
Okay, those are fairly esoteric use cases, I admit. :) However, PyPy already has some inlining capability in its optimizer, so it's not all that crazy of an idea that Python in general will need it.
It's kind of strange to argue with PyPy's inlining capabilities, since inlining in PyPy happens on a completely different level, that has nothing at all to do with Python code objects any more. So your proposed changes would not make a difference for PyPy (not even to speak about benefits). [snip] cheers, Carl Friedrich Bolz
[Michael Hudson]
"Phillip J. Eby" <pje@telecommunity.com> writes:
Even better if the lines for a particular piece of code don't have to all come from the same file.
This seems _fairly_ esoteric to me. Why do you need it?
For when Python code is generated from more than one original source file (referring to the `#line' directive message, a little earlier this week). For example, think include files. -- François Pinard http://pinard.progiciels-bpi.ca
Neil and I have been working on the AST branch for the last week. We're nearly ready to merge the changes to the head.
Nice work.
I don't think the current test suite covers all of the possible syntax errors that can be raised.
Do the AST branch generate a syntax error for: foo(a = i for i in range(10)) ? Raymond
[Jeremy]
Neil and I have been working on the AST branch for the last week. We're nearly ready to merge the changes to the head.
[Raymond]
Nice work.
Indeed. I should've threatened to kill the AST branch long ago! :) -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 10/13/05, Guido van Rossum <guido@python.org> wrote:
Indeed. I should've threatened to kill the AST branch long ago! :)
:-) I decreased a lot of the memory leaks. Here are some more to work on. I doubt this list is complete, but it's a start: PyObject_Malloc (obmalloc.c:717) _PyObject_DebugMalloc (obmalloc.c:1014) compiler_enter_scope (newcompile.c:1204) compiler_mod (newcompile.c:1894) PyAST_Compile (newcompile.c:471) Py_CompileStringFlags (pythonrun.c:1240) builtin_compile (bltinmodule.c:391) Tuple (Python-ast.c:907) ast_for_testlist (ast.c:1782) ast_for_classdef (ast.c:2677) ast_for_stmt (ast.c:2758) PyAST_FromNode (ast.c:233) PyParser_ASTFromFile (pythonrun.c:1291) parse_source_module (import.c:762) load_source_module (import.c:886) new_arena (obmalloc.c:500) PyObject_Malloc (obmalloc.c:699) PyObject_Realloc (obmalloc.c:837) _PyObject_DebugRealloc (obmalloc.c:1077) PyNode_AddChild (node.c:95) shift (parser.c:112) PyParser_AddToken (parser.c:244) parsetok (parsetok.c:165) PyParser_ParseFileFlags (parsetok.c:89) PyParser_ASTFromFile (pythonrun.c:1288) parse_source_module (import.c:762) load_source_module (import.c:886) Lambda (Python-ast.c:610) ast_for_lambdef (ast.c:859) ast_for_expr (ast.c:1443) ast_for_testlist (ast.c:1776) ast_for_expr_stmt (ast.c:1845) ast_for_stmt (ast.c:2716) PyAST_FromNode (ast.c:233) PyParser_ASTFromString (pythonrun.c:1271) Py_CompileStringFlags (pythonrun.c:1237) builtin_compile (bltinmodule.c:391) BinOp (Python-ast.c:557) ast_for_binop (ast.c:1389) ast_for_expr (ast.c:1531) ast_for_testlist (ast.c:1776) ast_for_expr_stmt (ast.c:1845) ast_for_stmt (ast.c:2716) PyAST_FromNode (ast.c:233) PyParser_ASTFromString (pythonrun.c:1271) Py_CompileStringFlags (pythonrun.c:1237) builtin_compile (bltinmodule.c:391) Name (Python-ast.c:865) ast_for_atom (ast.c:1201) ast_for_expr (ast.c:1555) ast_for_testlist (ast.c:1776) ast_for_expr_stmt (ast.c:1798) ast_for_stmt (ast.c:2716) PyAST_FromNode (ast.c:233) PyParser_ASTFromString (pythonrun.c:1271) Py_CompileStringFlags (pythonrun.c:1237) builtin_compile (bltinmodule.c:391)
On 10/14/05, Neal Norwitz <nnorwitz@gmail.com> wrote:
I decreased a lot of the memory leaks. Here are some more to work on. I doubt this list is complete, but it's a start:
Oh and since I fixed the memory leaks in a generated file Python/Python-ast.c, the changes still need to be implemented in the right place (ie, Parser/asdl_c.py). Valgrind didn't report any invalid uses of memory, though there is also a lot potentially leaked memory. It seemed a lot higher than what I remembered, so I'm not sure if it's an issue or not. I'll look into that after we get the definite memory leaks plugged. n
Hi Jeremy, On Thu, Oct 13, 2005 at 04:52:14PM -0400, Jeremy Hylton wrote:
I don't think the current test suite covers all of the possible syntax errors that can be raised. I'd like to add a new test suite that covers all of the remaining cases, perhaps moving some existing tests into this module as well.
You might be interested in PyPy's test suite here. In particular, http://codespeak.net/svn/pypy/dist/pypy/interpreter/test/test_syntax.py contains a list of syntactically valid and invalid corner cases. If you are willing to check out the whole of PyPy (i.e. http://codespeak.net/svn/pypy/dist) you should also be able to run the whole test suite, or at least the following tests: python test_all.py pypy/interpreter/test/test_compiler.py python test_all.py pypy/interpreter/pyparser/ which compare CPython's builtin compiler with our own compilers; as of PyPy revision 18722 these tests pass on all CPython versions (2.3.5, 2.4.2, HEAD). A bientot, Armin.
participants (12)
-
Armin Rigo -
Carl Friedrich Bolz -
François Pinard -
Greg Ewing -
Guido van Rossum -
jepler@unpythonic.net -
Jeremy Hylton -
Michael Hudson -
Neal Norwitz -
Neil Schemenauer -
Phillip J. Eby -
Raymond Hettinger