SeeGramWrap, a C parser

Fri Mar 5 08:43:26 EST 2004

I have uploaded "SeeGramWrap-03.02.2004.tgz" to my webpage at 
"http://members.tripod.com/~edcjones/pycode.html". SeeGramWrap parses a 
piece of C code and the resulting parse tree is output in man and 
machine readable form. The result can be used for program 
transformations. Since a particular trnsformation algorithm may not 
require all the information present in the tree, the user can select 
what to output.

This program has been written and tested only under linux.

Thanks to John Mitchell and Monty Zukowski for "cgram.tgz". Every parser 
generator need to have a good C grammar. Also thanks to Terrence Parr 
for ANTLR (http://www.antlr.org/).

Note: I have also uploaded
   pydocs.tar.gz which searchs the Python documentation.
   linenum.py which prints a linenumber and a message. Useful for debugging.

========================================================================
                                CONTEXT

A number of large C libraries have been wrapped so they can be called by 
Python. The wrapping code is repetitive and there may be a lot of it so 
methods have been developed for automated wrapping.

The best-known approach is SWIG (http://www.swig.org/). For complex 
wrappings, SWIG requires the writing of "typemaps", an unintuitive 
process where pieces of C code you write are spliced into the wrapper 
code generated by SWIG.

Another wrapper related approach is Pyrex which is found at
         http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
Pyrex has its own repetitive boilerplate that has to be written. But the 
Pyrex boilerplate is so straightforward that it can be taught 
algorithmically. See "Michael's Quick Guide to Pyrex" at 
"http://ldots.org/pyrex-guide/".

I think that the Pyrex boilerplate is _so_ straightforward that it can 
be machine generated. Therefore I have been sporatically developing 
software to do this. A thoroughly buggy version of this is on my web 
page, "http://members.tripod.com/~edcjones". It is called "cgram.tar.gz" 
(The name will be changed). Look at it but don't use it. "SeeGramWrap" 
is a major revision of the front end of "cgram.tr.gz".

I think the automatic-wrapper program can be made to work. It might be 
easier to use than SWIG. It is still a lot of work to prepare complex C 
header files. What we have is really a "program transformation" or "tree 
transformation" problem.

I think some of the issues are:

1. Since parser generators have a long and steep learning curve, I 
prefer to use them as black boxes which generate parsers which output 
results that I can analyze using Python. The parser created by a parser 
generator should output trees in two formats: one easy to look at and 
another that a program can easily read. For examples, see below.

2. I find trees very easy to work with. I want the trees to be front and 
center and highly visible. I prefer to "manipulate a tree" rather than 
"fire a rule".

3. The most common type of C macro has a type as one of its arguments:

     #define CAST(x, type) (type *) x

How can these be automatically wrapped for Python which is a dynamically 
typed language?

888888888888888888888888888888888888888888888888888888888888888888888888
                          TECHNICAL OVERVIEW

I use some C grammars associated with ANTLR. The grammar package is 
called "cgram". See "http://www.antlr.org/resources.html".

In "cgram" there is a java program "TestThrough.java" which parses C 
code into an AST then runs a tree grammar on the AST and outputs the 
original code. The tree grammar is named "GnuCEmitter.g". I work with 
this grammar because the terminal tokens are printed in the correct 
order. I modified the grammar turning it into a template. A piece of the 
original "GnuCEmitter.g" is:

----
typeQualifier
         :       a:"const"                       { print( a ); }
         |       b:"volatile"                    { print( b ); }
         ;
----

The modified version is:

----
typeQualifier
         :       a:"const"                       { <@ a @> }
         |       b:"volatile"                    { <@ b @> }
         ;
----

In this template, strings of the form "<@ ... @>" will each be replaced 
by a set of print statements. Moreover the entire rule will be wrapped 
by prints.  The template is used in "emitter/insert_prints.py". If 
"insert_prints.py" is run the result is:

----
typeQualifier
   { if ( inputState.guessing==0 ) {
           print(Open);
           print("typeQualifier");
        }
     }
         :  (
                 a:"const"        {  print(Open); 
print("typeQualifier.0"); print( a ); print(Close); }
         |       b:"volatile"     {  print(Open); 
print("typeQualifier.1"); print( b ); print(Close); }
            )
   { currentOutput.print(Close + MyTokenSep); }
         ;
----

If the original C program , "temp2.c", is

     char* s = "ab";

The output of the modified emitter grammar is "temp2.c.data":

----
     <<OPEN>>                 <<OPEN>>                 <<OPEN>>
     externalList             declarator               expr
     <<OPEN>>                 <<OPEN>>                 <<OPEN>>
     externalDef              pointerGroup             primaryExpr
     <<OPEN>>                 <<OPEN>>                 <<OPEN>>
     declaration              pointerGroup.0           stringConst
     <<OPEN>>                 *                        <<OPEN>>
     declSpecifiers           <<CLOSE>>                stringConst.0
     <<OPEN>>                 <<CLOSE>>                "ab"
     typeSpecifier            <<OPEN>>                 <<CLOSE>>
     <<OPEN>>                 declarator.0             <<CLOSE>>
     typeSpecifier.1          s                        <<CLOSE>>
     char                     <<CLOSE>>                <<CLOSE>>
     <<CLOSE>>                <<CLOSE>>                <<CLOSE>>
     <<CLOSE>>                <<OPEN>>                 <<CLOSE>>
     <<CLOSE>>                initDecl.0               <<CLOSE>>
     <<OPEN>>                 =                        ;
     initDeclList             <<CLOSE>>                <<CLOSE>>
     <<OPEN>>                 <<OPEN>>                 <<CLOSE>>
     initDecl                 initializer              <<CLOSE>>
----

This output can be processed by "tree.py" to produce "temp2.c.nest"

----
(externalList,
   (externalDef,
     (declaration,
       (declSpecifiers,
         (typeSpecifier,
           (typeSpecifier.1, |char|))),
       (initDeclList,
         (initDecl,
           (declarator,
             (pointerGroup,
               (pointerGroup.0, |*|)),
             (declarator.0, |s|)),
           (initDecl.0, |=|),
           (initializer,
             (expr,
               (primaryExpr,
                 (stringConst,
                   (stringConst.0, |"ab"|))))))), |;|)))
----

or "temp2.c.src":

     char * s = "ab" ;

If "temp2.c.src" is put through the entire process itself we get 
"temp2.c.src.src" which is identical to "temp2.c.src". This test is done 
by "docheck.py".

In the ".data" or ".nest" files the tokens from the original C code are 
in the correct order. It is easy to recover

     ('char', '*', 's', '=', '"ab"', ';')