[Python-checkins] python/dist/src/Python compile.txt, 1.1.2.9,
1.1.2.10
bcannon at users.sourceforge.net
bcannon at users.sourceforge.net
Wed Mar 23 16:51:35 CET 2005
Update of /cvsroot/python/python/dist/src/Python
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23056/Python
Modified Files:
Tag: ast-branch
compile.txt
Log Message:
Heavy rewrite, fleshing out high-level ideas with low-level details of
implementation.
Thanks to all who helped edit this doc at the PyCon 2005 sprint.
Index: compile.txt
===================================================================
RCS file: /cvsroot/python/python/dist/src/Python/Attic/compile.txt,v
retrieving revision 1.1.2.9
retrieving revision 1.1.2.10
diff -u -d -r1.1.2.9 -r1.1.2.10
--- compile.txt 16 Mar 2005 20:20:18 -0000 1.1.2.9
+++ compile.txt 23 Mar 2005 15:51:25 -0000 1.1.2.10
@@ -1,42 +1,125 @@
Developer Notes for Python Compiler
===================================
-Parsing
--------
-XXX Fill in
+Table of Contents
+-----------------
-Abstract Syntax Tree (AST)
---------------------------
+- Scope
+ Defines the limits of the change
+- Parse Trees
+ Describes the local (Python) concept
+- Abstract Syntax Trees (AST)
+ Describes the AST technology used
+- Parse Tree to AST
+ Defines the transform approach
+- Control Flow Graphs
+ Defines the creation of "basic blocks"
+- AST to CFG to Bytecode
+ Tracks the flow from AST to bytecode
+- Code Objects
+ Pointer to making bytecode "executable"
+- Modified Files
+ Files added/modified/removed from CPython compiler
+- ToDo
+ Work yet remaining (before complete)
+- References
+ Academic and technical references to technology used.
-The abstract syntax tree (AST) is a high-level description of the
-program structure with the syntactic details of the source text
-removed. It is specified using the Zephyr Abstract Syntax Definition
-Language (ASDL) [Wang97]_.
-The Python definition is found in the file ``Parser/Python.asdl``.
+Scope
+-----
-The definition describes the structure of statements, expressions, and
-several specialized types, like list comprehensions and exception
-handlers. Most definitions in the AST correspond to a particular
-source construct, like an 'if' statement or an attribute lookup. The
-definition is independent of its realization in any particular
-programming language.
+Historically (through 2.4), compilation from source code to bytecode
+involved two steps:
-XXX Is byte stream format what marshal_* fxns create?
-XXX no AST->python yet, right?
+1. Parse the source code into a parse tree (Parser/pgen.c)
+2. Emit bytecode based on the parse tree (Python/compile.c)
-The AST has concrete representations in Python and C. There is also
-representation as a byte stream, so that AST objects can be passed
-between Python and C. (ASDL calls this format the pickle format, but
-I avoid that term to avoid confusion with Python pickles.) Each
-programming language has a generic representation for ASDL and a tool
-to generate a code for a specific abstract syntax.
+Historically, this is not how a standard compiler works. The usual
+steps for compilation are:
-The following fragment of the Python abstract syntax demonstrates the
-approach.
+1. Parse source code into a parse tree (Parser/pgen.c)
+2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
+3. Transform AST into a Control Flow Graph (Python/newcompile.c)
+4. Emit bytecode based on the Control Flow Graph (Python/newcompile.c)
-XXX update example once decorators in
-::
+Starting with Python 2.5, the above steps are now used. This change
+was done to simplify compilation by breaking it down to two steps.
+The purpose of this document is to outline how the lattter three steps
+of the process works.
+
+This document does not touch on how parsing works beyond what is needed
+to explain what is needed for compilation. It is also not exahaustive
+in terms of the how the entire system works. You will most likely need
+to read some source to have an exact understanding of all details.
+
+
+Parse Trees
+-----------
+
+Python's parser is an LL(1) parser mostly based off of the
+implementation laid out in the Dragon Book [Aho86]_.
+
+The grammar file for Python can be found in Grammar/Grammar with the
+numeric value of grammar rules are stored in Include/graminit.h. The
+numeric values for types of tokens (literal tokens, such as ``:``,
+numbers, etc.) are kept in Include/token.h). The parse tree made up of
+``node *`` structs (as defined in Include/node.h).
+
+Querying data from the node structs can be done with the following
+macros (which are all defined in Include/token.h):
+
+- ``CHILD(node *, int)``
+ Returns the nth child of the node using zero-offset indexing
+- ``RCHILD(node *, int)``
+ Returns the nth child of the node from the right side; use
+ negative numbers!
+- ``NCH(node *)``
+ Number of children the node has
+- ``STR(node *)``
+ String representation of the node; e.g., will return ``:`` for a
+ COLON token
+- ``TYPE(node *)``
+ The type of node as specified in ``Include/graminit.h``
+- ``REQ(node *, TYPE)``
+ Assert that the node is the type that is expected
+- ``LINENO(node *)``
+ retrieve the line number of the source code that led to the
+ creation of the parse rule; defined in Python/ast.c
+
+To tie all of this example, consider the rule for 'while'::
+
+ while_stmt: 'while' test ':' suite ['else' ':' suite]
+
+The node representing this will have ``TYPE(node) == while_stmt`` and
+the number of children can be 4 or 7 depending on if there is an 'else'
+statement. To access what should be the first ':' and require it be an
+actual ':' token, `(REQ(CHILD(node, 2), COLON)``.
+
+
+Abstract Syntax Trees (AST)
+---------------------------
+
+The abstract syntax tree (AST) is a high-level representation of the
+program structure without the necessity of containing the source code;
+it can be thought of a abstract representation of the source code. The
+specification of the AST nodes is specified using the Zephyr Abstract
+Syntax Definition Language (ASDL) [Wang97]_.
+
+The definition of the AST nodes for Python is found in the file
+Parser/Python.asdl .
+
+Each AST node (representing statements, expressions, and several
+specialized types, like list comprehensions and exception handlers) is
+defined by the ASDL. Most definitions in the AST correspond to a
+particular source construct, such as an 'if' statement or an attribute
+lookup. The definition is independent of its realization in any
+particular programming language.
+
+The following fragment of the Python ASDL construct demonstrates the
+approach and syntax::
+
+ XXX update example once decorators in
module Python
{
@@ -45,17 +128,25 @@
attributes (int lineno)
}
-The preceding example describes three different kinds of statements --
-a function definition and return and yield statement. The function
-definition has three arguments -- its name, its argument list, and
-zero or more statements that make up its body. The return statement
-has an optional expression that is the return value. The yield
-statement requires an expression.
+The preceding example describes three different kinds of statements;
+function definitions, return statements, and yield statements. All
+three kinds are considered of type stmt as shown by '|' separating the
+various kinds. They all take arguments of various kinds and amounts.
-The statement definitions above generate the following C structure
-type.
+Modifiers on the argument type specify the number of values needed; '?'
+means it is optional, '*' means 0 or more, no modifier means only one
+value for the argument and it is required. FunctionDef, for instance,
+takes an identifier for the name, 'arguments' for args, and zero or more
+stmt arguments for 'body'.
-::
+Do notice that something like 'arguments', which is a node type, is
+represented as a single AST node and not as a sequence of nodes as with
+stmt as one might expect.
+
+All three kinds also have an 'attributes' argument; this is shown by the
+fact that 'attributes' lacks a '|' before it.
+
+The statement definitions above generate the following C structure type::
typedef struct _stmt *stmt_ty;
@@ -79,140 +170,225 @@
int lineno;
}
-It also generates a series of constructor functions that generate a
-``stmt_ty`` struct with the appropriate initialization. The ``kind`` field
-specifies which component of the union is initialized. The
-``FunctionDef`` C function sets ``kind`` to ``FunctionDef_kind`` and
-initializes the ``name``, ``args``, and ``body`` fields.
+Also generated are a series of constructor functions that allocate (in
+this case) a stmt_ty struct with the appropriate initialization. The
+'kind' field specifies which component of the union is initialized. The
+FunctionDef() constructor function sets 'kind' to FunctionDef_kind and
+initializes the 'name', 'args', 'body', and 'attributes' fields.
-CST to AST
-----------
-XXX Make sure basic flow of execution is covered
+Parse Tree to AST
+-----------------
-The parser generates a concrete syntax tree represented by a ``node
-*`` as defined in ``Include/node.h``. Node indexing starts at 0. Every
-token that can have whitespace surrounding it is its own token. This
-means that something like "else:" is actually two tokens: 'else' and
-':'.
+The AST is generated from the parse tree in (see Python/ast.c) using the
+function::
-The abstract syntax is generated
-from the concrete syntax in ``Python/ast.c`` using the function::
+ mod_ty PyAST_FromNode(const node *n);
- mod_ty PyAST_FromNode(const node *n);
+The function begins a tree walk of the parse tree, creating various AST
+nodes as it goes along. It does this by allocating all new nodes it
+needs, calling the proper AST node creation functions for any required
+supporting functions, and connecting them as needed.
-It does this by calling various functions in the file that all have the
-name ast_for_xx where xx is what the rule of the grammar (as defined
-in ``Grammar/Grammar``) that the function handles (alias_for_import_name
-is the exception to this). These in turn call the constructor functions
-as defined by the ASDL grammar to create the nodes of the AST.
+Do realize that there is no automated nor symbolic connection between
+the grammar specification and the nodes in the parse tree. No help is
+directly provided by the parse tree as in yacc.
-Common macros used to manipulate ``node *`` structs as defined in
-``Include/node.h``:
- - CHILD(node, n) -- Returns the nth child of node using zero-offset
- indexing
- - NCH(node) -- Number of children node has
- - STR(node) -- String representation of node
- - TYPE(node) -- The type of node as listed in ``Include/graminit.h``
- - REQ(node, type) -- Assert that the node is the type that is expected
+For instance, one must keep track of
+which node in the parse tree one is working with (e.g., if you are
+working with an 'if' statement you need to watch out for the ':' token
+to find the end of the conditional). No help is directly provided by
+the parse tree as in yacc.
-Function and macros for creating and using ``asdl_seq *`` types
-as found in Python/asdl.c and Include/asdl.h:
- - asdl_seq_new(int) -- Allocate memory for an asdl_seq for length
- 'size'
- - asdl_seq_free(asdl_seq *) -- Free asdl_seq struct
- - asdl_seq_GET(seq, pos) -- Get item held at 'pos'
- - asdl_seq_SET(seq, pos, val) -- Set 'seq' at 'pos' to 'val'
- - asdl_seq_APPEND(seq, val) -- Set the end of 'seq' to 'val'
- - asdl_seq_LEN(seq) -- Return the length of 'seq'
+The functions called to generate AST nodes from the parse tree all have
+the name ast_for_xx where xx is what the grammar rule that the function
+handles (alias_for_import_name is the exception to this). These in turn
+call the constructor functions as defined by the ASDL grammar and
+contained in Python/Python-ast.c (which was generated by
+Parser/asdl_c.py) to create the nodes of the AST. This all leads to a
+sequence of AST nodes stored in asdl_seq structs.
-Code Generation and Basic Blocks
---------------------------------
+Function and macros for creating and using ``asdl_seq *`` types as found
+in Python/asdl.c and Include/asdl.h:
-XXX Reformat: general discussion of basic blocks and compiler ideas (namespace
-generation, etc.), then discuss code structure and helper functions
-XXX Describe the structure of the code generator, the types involved,
-and the helper functions and macros.
-XXX Make sure flow of execution (namespace resolution, etc.) is covered after
-explanation of macros/functions
+- ``asdl_seq_new(int)``
+ Allocate memory for an asdl_seq for length 'size'
+- ``asdl_seq_free(asdl_seq *)``
+ Free asdl_seq struct
+- ``asdl_seq_GET(asdl_seq *seq, int pos)``
+ Get item held at 'pos'
+- ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)``
+ Set 'pos' in 'seq' to 'val'
+- ``asdl_seq_APPEND(asdl_seq *seq, void *val)``
+ Set the end of 'seq' to 'val'
+- ``asdl_seq_LEN(asdl_seq *)``
+ Return the length of 'seq'
-- for each ast type (mod, stmt, expr, ...), define a function with a
- switch statement. inline code generation for simple things,
- call the function compiler_xx where xx is the kind of type in question for
- others.
+If you are working with statements, you must also worry about keeping
+track of what line number generated the statement. Currently the line
+number is passed as the last parameter to each stmt_ty function.
-The macros used to emit specific opcodes and to generate code for
-generic node types use string concatenation to produce calls to the
-appropriate C function for the type of the object.
-The VISIT macro generates code for an arbitrary node type, by calling
-an appropriate compiler_visit_TYPE function. The VISIT_SEQ macro
-calls the visit function for each element of a sequence. The VISIT
-macros take three arguments:
+Control Flow Graphs
+-------------------
- - the current struct compiler
- - the name of the node's type (expr, stmt, ...)
- - the actual node reference
+A control flow graph (often referenced by its acronym, CFG) is a
+directed graph that models the flow of a program using basic blocks that
+contain the intermediate representation (abbreviated "IR", and in this
+case is Python bytecode) within the blocks. Basic blocks themselves are
+a block of IR that has a single entry point but possibly multiple exit
+points. The single entry point is the key to basic blocks; it all has
+to do with jumps. An entry point is the target of something that
+changes control flow (such as a function call or a jump) while exit
+points are instructions that would change the flow of the program (such
+as jumps and 'return' statements). What this means is that a basic
+block is a chunk of code that starts at the entry point and runs to an
+exit point or the end of the block.
+
+As an example, consider an 'if' statement with an 'else' block. The
+guard on the 'if' is a basic block which is pointed to by the basic
+block containing the code leading to the 'if' statement. The 'if'
+statement block contains jumps (which are exit points) to the true body
+of the 'if' and the 'else' body (which may be NULL), each of which are
+their own basic blocks. Both of those blocks in turn point to the
+basic block representing the code following the entire 'if' statement.
-The name of the node's type correspond to the names in the asdl
-definition. The string concatenation is used to allow a single VISIT
-macro to generate calls with the correct type.
+CFGs are usually one step away from final code output. Code is directly
+generated from the basic blocks (with jump targets adjusted based on the
+output order) by doing a post-order depth-first search on the CFG
+following the edges.
-- all functions return true on success, false on failure
-Code is generated using a simple, basic block interface.
+AST to CFG to Bytecode
+----------------------
- - each block has a single entry point
- * means code in basic block always starts executing at a single place
- * does not exclude multiple blocks pointing to the same entry point
- - possibly multiple exit points
- - when generating jumps, always jump to a block
- - for a code unit, blocks are identified by its int id
+With the AST created, the next step is to create the CFG. The first step
+is to convert the AST to Python bytecode without having jump targets
+resolved to specific offsets (this is calculated when the CFG goes to
+final bytecode). Essentially, this transforms the AST into Python
+bytecode, but with control flow represented by the edges of the CFG.
-Thus the basic blocks are used to model control flow through an application.
-This is often called a CFG (control flow graph). It is directed and can
-contain cycles in subgraphs since modeling loops does require it.
+Conversion is done in two passes. The first creates the namespace
+(variables can be classified as local, free/cell for closures, or
+global) creates CFG with the namespace info. With that done the second
+pass is done which essentially flattens the CFG into a list and
+calculates jump offsets for final output of bytecode.
-Below are are macros and functions used for managing basic blocks:
+The conversion process is initiated by a call to the function in
+Python/newcompile.c::
-- NEW_BLOCK() -- create block and set it as current
-- NEXT_BLOCK() -- NEW_BLOCK() plus jump from current block
-- compiler_new_block() -- create a block but don't use it
- (used for generating jumps)
+ PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags);
-- There are five macros for generating opcodes in the current basic
- block. Each takes the current struct compiler * as the first
- argument and an opcode name as the second argument.
+This function does both the conversion of the AST to a CFG and
+outputting final bytecode from the CFG. The AST to CFG step is handled
+mostly by the two functions called by PyAST_Compile()::
- ADDOP(c, opcode) -- opcode with no arguments
- ADDOP_I(c, opcode, oparg) -- oparg is a C int
- ADDOP_O(c, opcode, oparg, namespace) -- oparg is a PyObject * ,
- namespace is the name of a code object member that contains
- the set of objects. For example,
- ``ADDOP_O(c, LOAD_CONST, obj, consts)``
- will make sure that obj is in co_consts and that the opcode
- argument will be an index into co_consts. The valid names
- are consts, names, varnames, ...
- ADDOP_NAME(c, op, o, type) -- XXX
+ struct symtable * PySymtable_Build(mod_ty, const char *,
+ PyFutureFeatures);
+ PyCodeObject * compiler_mod(struct compiler *, mod_ty);
- XXX Explain what each namespace is for.
+The former is in Python/symtable.c while the latter is in
+Python/newcompile.c .
- ADDOP_JABS(XXX) -- oparg is an absolute jump to block id
- ADDOP_JREL(XXX) -- oparg is a relative jump to block id
- XXX no need for JABS() and JREL(), always computed at the
- end from block id
+PySymtable_Build() begins by entering the starting code block for the
+AST (passed-in) and then calling the proper symtable_visit_xx function
+(with xx being the AST node type). Next, the AST tree is walked with
+the various code blocks that delineate the reach of a local variable
+as blocks are entered and exited::
+
+ static int symtable_enter_block(struct symtable *, identifier,
+ block_ty, void *, int);
+ static int symtable_exit_block(struct symtable *, void *);
+
+Once the symbol table is created, it is time for CFG creation, whose
+code is in Python/newcompile.c . This is handled by several functions
+that break the task down by various AST node types. The functions are
+all named compiler_visit_xx where xx is the name of the node type (such
+as stmt, expr, etc.). Each function receives a ``struct compiler *``
+and xx_ty where xx is the AST node type. Typically these functions
+consist of a large 'switch' statement, branching based on the kind of
+node type passed to it. Simple things are handled inline in the
+'switch' statement with more complex transformations farmed out to other
+functions named compiler_xx with xx being a descriptive name of what is
+being handled.
+
+When transforming an arbitrary AST node, use the VISIT macro::
+
+ VISIT(struct compiler *, <node type>, <AST node>);
+
+The appropriate compiler_visit_xx function is called, based on the value
+passed in for <node type> (so ``VISIT(c, expr, node)`` calls
+``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar,
+ but is called on AST node sequences (those values that were created as
+arguments to a node that used the '*' modifier). There is also
+VISIT_SLICE just for handling slices::
+
+ VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty);
+
+Emission of bytecode is handled by the following macros:
+
+- ``ADDOP(struct compiler *c, int op)``
+ add 'op' as an opcode
+- ``ADDOP_I(struct compiler *c, int op, int oparg)``
+ add 'op' with an 'oparg' argument
+- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)``
+ add 'op' with the proper argument based on the position of obj in
+ 'type', but with no handling of mangled names; used for when you
+ need to do named lookups of objects such as globals, consts, or
+ parameters where name mangling is not possible and the scope of the
+ name is known
+- ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)``
+ just like ADDOP_O, but name mangling is also handled; used for
+ attribute loading or importing based on name
+- ``ADDOP_JABS(struct compiling *c, int op, basicblock b)``
+ create an absolute jump to the basic block 'b'
+- ``ADDOP_JREL(struct compiling *c, int op, basicblock b)``
+ create a relative jump to the basic block 'b'
+
+Several helper functions that will emit bytecode and are named
+compiler_xx() where xx is what the function helps with (list, boolop
+ etc.). A rather useful one is::
+
+ static int compiler_nameop(struct compiler *, identifier,
+ expr_context_ty);
+
+This function looks up the scope of a variable and, based on the
+expression context, emits the proper opcode to load, store, or delete
+the variable.
+
+As for handling the line number on which a statement is defined, is
+handled by compiler_visit_stmt() and thus is not a worry.
+
+In addition to emitting bytecode based on the AST node, handling the
+creation of basic blocks must be done. Below are the macros and
+functions used for managing basic blocks:
+
+- ``NEW_BLOCK(struct compiler *)``
+ create block and set it as current
+- ``NEXT_BLOCK(struct compiler *)``
+ basically NEW_BLOCK() plus jump from current block
+- ``compiler_new_block(struct compiler *)``
+ create a block but don't use it (used for generating jumps)
+
+Once the CFG is created, it must be flattened and then final emission of
+bytecode occurs. Flattening is handled using a post-order depth-first
+search. Once flattened, jump offsets are backpatched based on the
+flattening and then a PyCodeObject file is created. All of this is
+handled by calling::
+
+ PyCodeObject * assemble(struct compiler *, int);
-- symbol table pass and compiler_nameop()
-XXX
Code Objects
------------
-XXX Describe Python code objects: fields, etc.
+In the end, one ends up with a PyCodeObject which is defined in
+Include/code.h . And with that you now have executable Python bytecode!
-Files
------
+
+Modified Files
+--------------
+ Parser/
@@ -224,7 +400,8 @@
Language." Uses SPARK_ to parse the ASDL files.
- asdl_c.py
- "Generate C code from an ASDL description."
+ "Generate C code from an ASDL description." Generates
+ ../Python/Python-ast.c and ../Include/Python-ast.h .
- spark.py
SPARK_ parser generator
@@ -232,10 +409,10 @@
+ Python/
- Python-ast.c
- Creates C structs corresponding to the ASDL types. Also contains code
- for marshaling AST nodes (core ASDL types have marshaling code in
- asdl.c).
- "File automatically generated by ../Parser/asdl_c.py".
+ Creates C structs corresponding to the ASDL types. Also
+ contains code for marshaling AST nodes (core ASDL types have
+ marshaling code in asdl.c). "File automatically generated by
+ ../Parser/asdl_c.py".
- asdl.c
Contains code to handle the ASDL sequence type. Also has code
@@ -243,7 +420,7 @@
identifier. used by Python-ast.c for marshaling AST nodes.
- ast.c
- Converts Python's concrete syntax tree into the abstract syntax tree.
+ Converts Python's parse tree into the abstract syntax tree.
- compile.txt
This file.
@@ -264,10 +441,6 @@
- ast.h
Declares PyAST_FromNode() external (from ../Python/ast.c).
-Known Bugs/Issues
------------------
-
-XXX
ToDo
----
@@ -275,8 +448,7 @@
+ Grammar support (Parser/Python.asdl, Parser/asdl_c.py)
- decorators
- empty base class list (``class Class(): pass``)
- - AST->Python object support
-+ CST->AST support (Python/ast.c)
++ parse tree->AST support (Python/ast.c)
- decorators
- generator expressions
+ AST->bytecode support (Python/newcompile.c)
@@ -287,23 +459,27 @@
- rewrite compiler package to mirror AST structure?
+ Documentation
- flesh out this doc
- * compiler concepts covered
- * structure and flow of all steps clearly explained
- * break up into more sections/subsections
+ * byte stream output
+ Universal
- make sure entire test suite passes
- fix memory leaks
- make sure return types are properly checked for errors
+ - no gcc warnings
References
----------
+.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
+ `Compilers: Principles, Techniques, and Tools`,
+ http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
+
.. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
S. Serra. `The Zephyr Abstract Syntax Description Language.`_
In Proceedings of the Conference on Domain-Specific Languages, pp.
213--227, 1997.
.. _The Zephyr Abstract Syntax Description Language.:
- http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97-abstract.html.
+ http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html
.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/
+
More information about the Python-checkins
mailing list