Assembler Parser/Lexer in Python

Fri Nov 7 00:27:28 EST 2003

Simon Foster wrote:

> Anyone have any experience or pointers to how to go about creating
> a parser lexer for assemble in Python.  I was thinking of using PLY
> but wonder whether it's too heavyweight for what I want.  Anyone have 
> any thoughts?

There are, of course, lots of tools available to help you out with
this, but as you probably realize, most of them have heavyweight
features which help out for higher-level languages, but not really
so much for assembler.  Also, most of them will probably not give you
great help for assembly language macros, which are typically more
full-featured than C macros, in that they know something about
the actual program being built. (Please note that I am _not_
cross-posting this to comp.lang.lisp, and also that if you want
to parse pre-existing assembly language, it will look _nothing_
like lisp, so the built-in parser wouldn't help you out in any
case.  Also note that YMMV, but while I use macros _extensively_
in assembly language, I have personally never felt the necessity
of having any sort of macro processor in Python :)

I had a similar problem, in maintaining a system with over 10MB of
crufty ancient assembly language.  I had conflicting goals of wanting
to use Python so I could easily and correctly do different things with
the source code (code rewriting, automatic HTML generation, some
lint-like operations, etc.) and wanting operations to complete rapidly
so that I could do some of it in the typical Python experimental mode.

I wrote a lexer using a tiny bit of C and a Pyrex wrapper.  The
partitioning was such that the C code knows nothing about Python,
and the Pyrex interface handles the higher layers of the tokenization.

The lexer performs a single tokenization pass over an entire file,
(with the Pyrex calling the C code once per line) and returns a list
of token tuples (one tuple per line).  Macro lines which invoke text-
pasting operations are flagged, and the lexer is re-run on these
lines when they are encountered at parse time.

A separate (and very simple!) Python script generates a .h file
which contains the lowest-level lexer tables.

I did an earlier version of this in mxTextTools, which is not too
bad for such a thing if a) for whatever reason, you don't want to
write your own C extensions, and b) you're not doing too much
maintenance on the actual lexer.

I also played around with re, but if you do that, you will quickly
come to realize why lexer generators are popular :)

More recently, I played a little bit with psyco.  If I didn't care
quite as much about speed and didn't already have the C code, I
might consider one of the existing parser/lexer generators in
conjunction with psyco.  Unfortunately, I've only dabbled in
a very minor way with some of these packages, so I couldn't begin
to compare their strengths and weaknesses.

Hope this helps.

Pat