[Tutor] is this a state machine? [designing a program to detect WikiWords]

Thu Aug 7 19:17:42 EDT 2003

On Thu, 7 Aug 2003, Kirk Bailey wrote:

> def iswikiword(word):				# test to see is a wikiword
> 	caps='ABCDEFGHIJKLMNOPQRSTUVWXYZ'	# define capitalletters
> 	prior=0					# set inital state for flags
> 	count=0					# wikiwords have 2+transitions
> 	for letter in word:			# examine each letter
> 		if letter in caps:		# if it IS a capital letter,
> 			prior=1			# store that state
> 		else:					# and if not,
> 			if prior==1:		# but the previous one WAS,
> 				prior=0		# reset the flag,
> 				count=count+1	# anc count the transition!
> 			else:				# otherwise,
> 				pass 			# chill out.
> 	if count > 1:				# ok, the results are in;
> 		return 1				# if there are 2+, itis.
> 	else:						# and if not,
> 		return 0				# say so.

[Warning: case study of using finite state machines to recognize
WikiWords.]

Hi Kirk,

Let's define what a "wiki word" is in English, before defining it in code.
According to:

    """It is a concatenation of words (of two or more letters each)
    without white space between them, and with the first letter (only) of
    each component word capitalized."""

    http://c2.com/cgi/wiki?WikiWords

The code above will work for a lot of cases, but will break on something
like this:

###
>>> iswikiword("ThisIsNOTAWikiWord")
1
###

A finite state machine is also not allowed to do counting, so the 'count'
variable in the function will disqualify it from being a finite state
machine.

We can imagine a finite state machine to recognize WikiWords that might
look like this:

                   capital?
          START --------------> On1stCapitalLetter
                                      |
                                      |
                                      | lowercase?
                                      |
                                      |
                                      | <--------+
                                      |          |
                                      |          | lowercase?
                                      |          |
                                      V          |
                                  FirstWord -----+
                                      |
                                      |
                                      | capital?
                                      |
                                      |
                                      V
                        +---> OnAnotherCapitalLetter
                        |             |
                        |             |
                        |             | lowercase?
                        |             |
                        |             |
                        |             | <---------+
              capital?  |             |           |
                        |             |           | lowercase?
                        |             |           |
                        |             V           |
                        |       **ANOTHERWORD ----+
                        |             |
                        |             |
                        |             |
                        +-------------+

This looks suspiciously like a flow diagram, for which I apologize: it's
hard drawing circles in ASCII, so I've omitted them.  *grin*

The diagram above has 5 states:

    [START, On1stCapitalLetter, FirstWord, OnAnotherCapitalLetter,
     ANOTHERWORD]

Whew.  That's a mouthful.  Let's shorten the names slightly:

    [START, On1st, FirstWord, OnAnother, ANOTHERWORD]

I've labeled the state that we start on as "START".  We jump from state to
state, depending if the next letter is uppercase or lowercase --- and if
we ever "fall off" the machine, then the word definitely isn't a Wiki
Word.  Think of it as a board game.

If it's a board game, how do we win?  How do we know if we are recognizing
a WikiWord?  Well, if we're on the ANOTHERWORD state by the time we're at
the end of our word, we'll consider that to be a WikiWord: that's the goal
of any word that wants to have the Wiki nature.

Before we put down any Python code, let's try it out on something like
"WikiWord":

------

time   state   next letter   comments
----   -----   -----------   --------

 1     START       W          'W' is capitalized, so we jump to
                              On1stCapitalLetter

 2     On1st       i          'i' is lowercased, so we jump to Firstword

 3     FirstWord   k          'k' is lowercased.  There's an "lowercase?"
                              arrow that points back to FirstWord, so we
                              jump back to FirstWord.

 4     FirstWord   i          Ditto.

 5     FirstWord   W          Ok, we've seen another capital.

 6     OnAnother   o          Yeah, now we can jump onto ANOTHERWORD!

 7     ANOTHERWORD r          ... but we can't celebrate until we finish
                              reading the rest of our input!

 8     ANOTHERWORD d          Almost there.

 9     ANOTHERWORD            Done!  And yes, 'WikiWord' is a WikiWord

------

Once we try a few words against our game board, we'll probably feel more
comfortable that it's doing the right thing.  Now to implement it in
Python!  *grin*

The diagram above is easy for us humans to trace, but not so convenient
for computers.  Let's make it slightly easier by "redrawing" that diagram
as a table:

                    capital?      lowercase?
-----------------------------------------------
START        |      On1st          FAIL
On1st        |      FAIL           FirstWord
FirstWord    |      OnAnother      FirstWord
OnAnother    |      FAIL           ANOTHERWORD
ANOTHERWORD  |      OnAnother      ANOTHERWORD
FAIL         |      FAIL           FAIL

I've sneaked in one more state, a "FAIL" state that catches the situation
when we "fall" off the diagram.  Think of it as a black hole for failed
WikiWords.

If we want to be concise, we can do another renaming of the states of our
machine, from words to numbers.  Let's do that.

      capital?   lowercase?
--------------------------
0   |    1           5
1   |    5           2
2   |    3           2
3   |    5           4
4   |    3           4
5   |    5           5

And something like this is easy to represent as a list:

###
>>> transitions = [[1, 5],
...                [5, 2],
...                [3, 2],
...                [5, 4],
...                [3, 4],
...                [5, 5]]
###

And now, we can write something that uses this table to jump around our
state diagram:

###
def isWikiWord(word):
    """Code to demonstrate a Finite State Machine that recognizes
       WikiWords."""
    SUCCESS_STATE = 4
    TRANSITIONS = [[1, 5],
                   [5, 2],
                   [3, 2],
                   [5, 4],
                   [3, 4],
                   [5, 5]]
    current_state = 0
    for letter in word:
        if letter.upper() == letter:
            current_state = TRANSITIONS[current_state][0]
        else:
            current_state = TRANSITIONS[current_state][1]
    return current_state == SUCCESS_STATE
###

The code itself is simple --- if completely nonobvious.  *grin* The real
complexity of the code has been hidden by our transition state table, but
the actual running of the code is pretty tame.

But does it work?

###
>>> isWikiWord("WikiWord")
1
>>> isWikiWord("WikiWordS")
0
>>> isWikiWord("The")
0
>>> isWikiWord("TheS")
0
>>> isWikiWord("SeeMe")
1
>>> [(word, isWikiWord(word))
...  for word in '''WikiWords is a MixedCase "word"
...                 which is useful in identifying
...                 and linking WikiPages'''.split()]
[('WikiWords', 1), ('is', 0), ('a', 0),
 ('MixedCase', 1), ('"word"', 0), ('which', 0),
 ('is', 0), ('useful', 0), ('in', 0),
 ('identifying', 0), ('and', 0), ('linking', 0),
 ('WikiPages', 1)]
###

One very cute thing about finite state machines is that they can be
transformed into regular expressions.  That is, we can first draw a
diagram, and then calculate an equivalent regular expression that does the
same work as the finite state machine.

The process is slightly involved, and this post is too way long as it is.
*grin* But here's the end result of the transformation:

###
>>> regex = re.compile('''
...     [A-Z][a-z]+
...     [A-Z][a-z]
...     ([A-Z][a-z] | [a-z])*
...     ''', re.VERBOSE)
>>> regex.match('Wiki')
>>> regex.match('WikiWord')
<_sre.SRE_Match object at 0x8168128>
>>> regex.match('WikiWordsAreNeat')
<_sre.SRE_Match object at 0x8126060>
>>> regex.match('FooBAR')
>>> regex.match('FooB')
>>> regex.match('FooBar')
<_sre.SRE_Match object at 0x8168128>
###

If you're interested in this sort of stuff, you may want to look through
"Introduction to the Theory of Computation":

    http://www-math.mit.edu/~sipser/book.html

It's a pricey book, but awesome in its presentation of these ideas.

I hope this helps!