dictionary comparison

Mon Aug 14 07:47:55 EDT 2000

<pzs at dcs.gla.ac.uk> wrote in message news:8n8hqc$l24$1 at nnrp1.deja.com...
    [snip]
> > keys=FSA.keys()
> > keepdict={}
> > removedict={}
> >
> > for istate1 in range(len(keys)-1):
> >     state1=keys[istate1]
> >     if removedict.has_key(state1): continue
> >     for istate2 in range(istate1+1, len(keys)):
> >         state2=keys[istate2]
> >         if FSA[state1]==FSA[state2]:
> >             keepdict[state1]=1
> >             removedict[state2]=1
> >
> > keeplist = keepdict.keys()
> > removelist = removedict.keys()
> >
> > If state1 is in the removedict, so also are any other
> > 'following' states whose transitionlists might be
> > identical with it (they've been placed there on the
> > same outerloop as state1 was); so we need check
> > no more.

...and also, any state that goes on the keeplist cannot be
on the removelist too -- that is easy to see.

> This is indeed a much more elegant solution than my current one, but I
> am now faced with the problem that in fact simply a remove list is not
> enough information to trim the FSA. The items from the remove list

Right, you need a removedict -- where each state on the removelist
corresponds to the state (on the keeplist) that caused it to be removed
and to which, therefore, it is equivalent (==with which it must be
replaced).  So, in the above snipped, change the single line:

        removedict[state2]=1

into:

        removedict[state2]=state1

and voila, you have the extra info you need.
I don't think you need keeplist and keepdict here, btw, so you might
just take those 2 lines away.

> cannot simply be excised, since any other state that might happen to be
> pointing at them must be rerouted to one of the states in the keep list
> which is equivalent to it.

There is only one state on the keeplist that's equivalent to any given state
on the removelist in this approach, by definition of an equivalence
relationship.  We take only one sample (the first in the random ordering
incorporated in the keys list) from each equivalence class.

> For example if states a and b are equivalent,
> a will be added to the keep list and b to the remove list. However, when
> b comes to be removed a further state d, which contains a transition to
> b, must have this entry changed to instead indicate a transition to the
> equivalent a.

So, with these mods, you will change d's entry to indicate removedict[b],
which will be a.

> At the moment, after every match I iterate through the entire structure
> rerouting the transitions, but the FSAs often contain a large number of
> identical states, and to remove them all and reroute in one go would be
> preferable. If I store with the removal list a state to which
> transitions to the now redundant state should be rerouted, how can I be
> sure the rerouting point will not itself be a member of the list to
> remove without excessive book-keeping?

By formalizing the above "easy to see" proof that we place in keeplist
(and therefore as value in the dictionary for keys in removelist) only
one state out of each equivalence class.  Per absurdum: suppose two
equivalent states A, B were there and (without loss of generality) that
A comes before B in the keys list.  But then, when the outerloop came
to examine A, B must have been examined by the innerloop, found
equivalent, hence placed as a key in the removedict.  So when at a
successive outerloop iteration B was examined, the remove_dict.has_key
must have been true for B, so all of the innerloop was skipped for B
and it can never have ended up as the value for any key in remove_dict.

We can get much more formal than this if you want, extracting and
expressing the loop-invariants, but I feel pretty confident already.

> If I can make this approach efficient, a number of iterations of it will
> implement a DFA reduction algorithm without the more involved process
> described by Hopcroft and Ullman - I have already implemented a subset
> construction routine for NFA-DFA conversion.
>
> Thanks for your help so far,

You're welcome, and I think you'll find that the problem is close to
being solved at this point (without the need for a keeplist, surely not
one made up of all and only the values that are also values for 1+
keys in removedict; if you need a keeplist, it's one of all states that
are not keys in removedict, and we can have that implicitly).

The whole removal-of-duplicates might be something like
(assuming FSA[state] is just a list of other states, for
definiteness in translating; adjust to taste if it's a more
elaborate structure such as an input->state mapping):

while 1:
    keys = FSA.keys()

    # find out which states are equivalent and
    # can thus be simplified down to one
    removedict = {}
    for istate1 in range(len(keys)-1):
        state1 = keys[istate1]
        if removedict.has_key(state1): continue
        for istate2 in range(istate1+1, len(keys)):
            state2 = keys[istate2]
            if FSA[state1] == FSA[state2]:
                removedict[state2] = state1

    # exit when no more simplifications can
    # be performed (no equivalent states)
    if not removedict.keys(): break

    # apply the simplifications
    newFSA = {}
    for state in keys:
        # don't reproduce states to be simplified
        if removedict.has_key(state): continue
        transitions = []
        for dest in FSA[state]:
            newDest = removedict.get(dest,dest)
            transitions.append(newDest)
        newFSA[state] = transitions

    FSA = newFSA

Only noticeable detail here is the handy Python idiom:

    newDest = removedict.get(dest,dest)

which puts in newDest: either dest itself, if it's not a
key in removedict; or, if dest IS a key in removedict,
then whatever corresponds to it in that mapping.

Of course, this is just one style of presentation.  One
might be more verbose, e.g. changing this loop into:

        for dest in FSA[state]:
            if removedict.has_key(dest):
                dest = removedict[dest]
            transitions.append(dest)

or more compact/idiomatic, with map/filter/etc. But
the basic algorithm, it seems to me, is in place.

The largest cost is the O(N*N) in the first outerloop.
You might reduce that by sorting, O(N logN), so that
equivalent states end up adjacent to each other, as
then you can identify the equivalence with a single
pass, O(N).

Alex