[pypy-dev] Language Parser Ontology

anatoly techtonik techtonik at gmail.com
Mon Mar 9 19:43:15 CET 2015


I'll start from afar, so that it will be easier to understand what I
am thinking about..

CFFI uses pycparser, which parses C files, but! uses C compiler
to strip comments from C files and process defines, but almost
all .c files contain comments, so pycparser is basically useless
as a parser, but maybe it has a good API for working with AST.

Anyway, I tried to see if I can teach pycparser to strip
comments itself, and in c_lexer.py I found a list of tokens,
among which there were no token representing the comment
start. Stripped list:

    ##
    ## All the tokens recognized by the lexer
    ##
    tokens = keywords + (
        # Identifiers
        'ID',

        # Type identifiers (identifiers previously defined as
        # types with typedef)
        'TYPEID',

        # constants
        'INT_CONST_DEC', 'INT_CONST_OCT', 'INT_CONST_HEX',
        'FLOAT_CONST', 'HEX_FLOAT_CONST',
        'CHAR_CONST',
        'WCHAR_CONST',
.    ...

So I thought that I need to add a name for a token
corresponding to comments start //, /* and end */
and it will be better if the token name would be somewhat
common among parsers, so that people looking at token
could immediately recognize that it is a comment related.
Apparently, properly naming is a little bit ambiguous for a
automated processing. Editors like Spyder could also
benefit information about token and their meaning in
different programming languages. The processing of text
comments that can be catched from the parsing stream is
same for any language and could be IDE independent.
Right now you can't just reuse the language definitions
(such as ASDL) to just feed the IDE so that it can
automatically figure out, what parts of text it can attach
its functions to.

I read the ontologies is way to express relations between
object in this automatic was as triples. Like;

  COMMENTSTART is a TOKEN
  COMMENTSTART starts a COMMENT

And I wonder, have anybody tried to apply this ontology
stuff to designing and analysing computer languages?
If yes, maybe there are some databases with such
information about parsers. I would like to query names of
all tokens that represent a program comment.

-- 
anatoly t.


More information about the pypy-dev mailing list