Plex 0.1 - A Lexical Analysis Module

Wed Feb 9 23:22:03 EST 2000

[posted & mailed]

[on Greg Ewing's nascent "traditional DFA" lexer, at
 http://www.cosc.canterbury.ac.nz/~greg/python/Plex
]

[Tim, suggests "direct computation of regular language intersection,
 complement, difference, and reversal, as well as regexpy unions
 and catenations"
]

[Greg]
> Tell me more about how to implement these, and I'll
> see what I can do. If it's just a matter of gluing
> NFAs together, it should be quite easy.

Easy and (honest!) fun (btw, while it would be awfully nice for the DFA
match function to get coded in C, the manipulations going into building a
DFA are probably better left in Python -- "DFA compile time" isn't at all
crucial for the most pressing uses).

Here are implementation sketches.  A core notion is the concept of a
"complete" DFA.  A DFA is complete iff for every state S and every symbol s,
there's an explicit transition from S on s.  If a DFA is not complete, it
can be made complete by introducing a "sink state" K, which is just a
non-accepting state that transitions to itself on every symbol.  Then direct
all the "missing" symbol+state transitions to K.

A complete DFA is wholly explicit about what happens in every case, which is
crucial in getting complementation to work correctly.  I'm going to assume
that all inputs are complete DFAs below (it greatly simplifies reasoning --
in practice, you can cheat; in particular, it's *usually* fine if an input
is an NFA, so I'll mix terminology a little).

Complement:  For each state, if it's non-accepting make it accepting, and
vice versa.  Done!  The transitions and set of start states don't change.

Intersection of DFAs A and B:  The usual construction is to simulate running
A & B in parallel, building a new NFA whose states are the Cartesian product
of A's states and B's states, and where a new state is accepting iff its
constituent states were both accepting.  The whole bit is obvious after a
little thought.  A much slicker way is to compute

    complement(union(complement(A), complement(B)))

!  Not only pretty, but much harder to screw up <wink>.

Union of DFAs A and B:  Introduce a unique new start state S.  Introduce
epsilon transitions from S to all the start states in A and in B.  Done.

Difference of DFAs A and B (recognize everything recognized by A but not
recognized by B):  intersection(A, complement(B)).

Reversal (accept strings that are the reversals of the strings accepted by
the input DFA):  (1) Reverse the arrows (that is, if S transitioned to T on
symbol s, afterwards T transitions to S on s).  (2) Set of start states
becomes the old set of accepting states.  (3) Set of accepting states
becomes the old set of start states.  Note that the result is very likely to
be an NFA.

Reversal is especially interesting because of a beautiful and unexpected
result due to Brzozowski:  modulo some fiddling to get rid of sink states
and unreachable states, a *minimal* DFA can be constructed from an NFA A via

   make_dfa(reverse(make_dfa(reverse(A))))

where make_dfa is the usual NFA->DFA "subset" construction.  Very pretty,
impossible to get wrong <wink>, but generally (in my experience) slower than
the usual minimization algorithms.

A note on completeness:  making an FA explicitly complete-- even if it's
only for intermediate constructions --becomes extremely unattractive in a
Unicode world.  I didn't find any literature on this, so (probably re-)
invented a notion of "NOT-set" transitions:  in my "overly general" FSM
classes, each NFA state has a symbol set N associated with it, and a target
NOT-set T of states, such that any symbol *not* in N transitions to the
states in T.  The details proved delicate in some cases, but it all worked
out quite prettily.  Unfortunately, it complicates the runtime matcher
too -- it's really an extension of the DFA concept.  In partial
compensation, the alphabet doesn't need to be specified at DFA construction
time (indeed, can be specified at runtime -- or even left implicit
forever!).  Overall, it probably adds more complications than it's worth.

computers-were-built-for-english-speakers-anyway<wink>-ly y'rs  - tim