[Cython] Towards a formal Cython grammar

Robert Bradshaw robertwb at gmail.com
Sun Aug 24 07:46:37 CEST 2014


I've started playing around with writing a grammar for Cython. As well
as formally defining the language (in particular, with respect to
Python) this should allow us to eventually move to using
parser-generators rather than ad-hoc hand written code and be useful
to  external tools (e.g. IDEs, linters, syntax highlighting, etc.)

I've posted what I have at
https://github.com/robertwb/cython/tree/grammar , in particular

https://github.com/robertwb/cython/blob/grammar/Cython/Parser/Grammar
https://github.com/robertwb/cython/compare/robertwb:3aa9056943f83a68cc9d9335f8a9c81e9a6f3f91...363bc162fd626203f832a33b6c736ff8b10f6086#diff-8b69afcfc588fde3d763f2ec670e42c2L1

Nothing beyond generating the raw parse tree is hooked up yet, and
even that requires an extra directive (formal_grammar=True). Building
and using the parser requires a source-built Python (it looks for the
pgen artifact Python uses to compile the grammar). We may or may not
want to stick with this approach long term (though if we do we might
ship the generated .c files). This parse tree is still pretty
low-level, we might want to also create something like Python.asdl to
give us a (closer) AST.

The grammar isn't complete, but should cover a most of the language
(over 3/4 of the test suite passes, and that explores a lot of the
corner cases). The most notable omissions are that it's using Python's
lexer, so doesn't have a token for '?' or the additional literal
string prefixes/int literal suffixes. Also, as the lexer clearly
doesn't understand includes, these are handled by inlining in a
preparsing step (which messes with line numbers).

The grammar could be tightened up as well. For example, this grammar
doesn't distinguish between valid pxd vs. pyx constructs, and allows
cdef statements within if statements (or even normal class
declarations).

I tried to restrict the existing grammar as little as possible, in
particular the only new illegal identifiers are 'cdef', 'ctypedef' and
(for ambiguity reasons new' and 'sizeof'). Also, the "cython" keywords
may not be used for identifiers that might be typed (e.g. "def
foo(int): ..." is not allowed).

The most significant departure from the existing "grammar" is that
rather than using C-style declarators, cdef declarations are of the
form "cdef [type] name." Thus one write the (already legal)

    cdef double[3] loc

and

    cdef double* a, b

declares two pointers. How to handle function pointers is still up in
the air, but I wouldn't be opposed to moving to a new syntax (e.g.
"(double, void*) -> double" inspired by Py3) for those. It disallows
empty "declarators" for parameters of function declarations (though we
could consider adding this back).

i think this would also be a nice chance to simplify the grammar, so
there are some intentional ommisions. Most notably, there are several
modifiers (e.g the __cdecl, __stdcall, __fastcall callspecs, maybe
inline, maybe even "with nogil", and the "cdef class Foo [object
object_struct_name, type type_object_name ]" spec for external
classes) that would make more sense as decorators. This would be
backwards incompatible, but these are not commonly used features and
fair warning (or even translators, I included a sed script to deal
with the common case of declarators mentioned above) could be given
and I think could be worth the simplification.

Thoughts?

- Robert


More information about the cython-devel mailing list