Confusion about dictionaries - keys use value or identity?

Sun Jul 8 23:24:06 EDT 2001

[Roy Smith]
> ...
> I'm writing a parser for a kind of text file we use a lot.  The main
> loop  of the parser is a state machine.  In each state, the current
> input line is  tested to see if it matches one or more regular
> expressions, and based on  which (if any) it matches, you do some
> processing and advance to the next state.  Something like:
>
>    state = "cvsid"
>    for line in inputfile.readlines():
>       if state == "cvsid":
>          if re.match ("^\s*$\s*Id:([^$]*)$\s*$", line):

Always use r-strings for regexps!  You're going to get "mysterious failures"
if you don't, as time goes on and you change the regexps.  For example, add
a \b to the above to match at a word boundary, and it's actually going to
insist on matching a backspace character instead.  It's just an accident
that "\s" yields a two-character string and leaves the backslash intact,
while "\b" yields chr(8).  r-strings are WYSIWYG in all cases.

>             state = "header"
>        elif state == "header":
>          if re.match ("big ugly pattern", line):
>             do some stuff
>             state = "what comes after the header"
>
> and so on.  I'll end up with about a dozen different states, with
> perhaps 2 dozen different regex's.  Here's the dilema.  If I write
> it as I did above, the regex's will get compiled each time they're
> evaluated, which is clearly inefficient in the extreme.

The good news is that this part isn't true:  all versions of Python have
always maintained an internal compiled-regexp cache, mapping regexp strings
to their compiled forms, much like the one you sketched building "by hand"
later.  Current Python maintains a cache of 100 pairs.

The bad news is that I recommend the alternative anyway:

> The alternative would be to re.compile() all  the regex's once, at the
> top, then use the stored programs,

The great *advantage* to this is that you wouldn't be so reluctant to
exploit re.VERBOSE mode if the regexps were compiled at module level.  The
clarity added by using whitespace and inline comments to document the
*intent* of regexp gibberish is of overwhelming maintenance value.

> something like this:
>
>    cvsPattern = re.compile ("^\s*$\s*Id:([^$]*)$\s*$")
>          [...]
>          if cvsPattern.match (line):
>             state = header
>
> and so on.  This would certainly work, but it moves the regex's away
> from where they are used, making (IMHO) the program more difficult to
> read and understand.

re.VERBOSE mode regexps can be much easier to read and understand; so long
as you leave them inline, you're going to tend (as your examples *do*) to
squash them into as little space as possible with no explanation at all.

> It reminds me of the bad old days when we would collect all
> our FORTRAN format statements at the back of the deck.  See
> http://www.python.org/doc/Humor.html#habits for why I don't want
> to do that any more :-)

There's one huge difference:  FORMAT statements were identified by
meaningless integers, but you can give compiled regexp objects descriptive
names instead.

BTW, abstract away a little more, and you can likely build an entirely
table-driven state machine for this task, with simple and uniform code that
merely walks the table.  Nested dicts make great state-machine tables.