[Tutor] Regular expression
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Mon Oct 11 03:39:27 CEST 2004
> >Hmmm! Ok, it sounds like you're making some kind of mini-language to
> >make it easier to type out Graphviz dotty files. Each one of these
> >out-edge descriptions appears to have some kind of structure:
> >
> > out_edge_description ::=
> > open_brace
> > start_vertex_name
> > pipe_symbol
> > a_bunch_of_colon_separated_names
> > close_brace
> >
> >We can parse this pretty informally, by using regular expressions.
> >But there's also a fairly systematic way we can attack this: we can go
> >all out and use a token/parser approach. Would you like to hear about
> >that?
> >
> I don't know about kumar but I would love to hear about this because
> I've been reading about it but it has not sunk in yet.
[Warning: really long message ahead. I have to learn how to explain this
better... *sigh*]
Ok, let's do a demonstration of a tokenizer/parser technique to this
problem. Just to make the problem slightly different, let's try to write
a program that takes strings like:
"[3,1,4,1,5]"
and translate them to an equivalent Python list. Yes, I know that we can
use eval() to do this, but since we say that eval() is evil, we'd better
show an alternative way to do this. *grin*
The tokenizer/parser technique is a two step process:
tokenization: take the string, and break it down into chunks.
parsing: apply some structure to the chunks.
Concretely, a "tokenization" of a string like "[3,1,4,1,5]" will puree
it into something like this:
[("LEFT-BRACKET",), ("NUMBER", 3), ("COMMA",), ("NUMBER", 4),
("COMMA",), ("NUMBER", 4), ("COMMA",), ("NUMBER", 1), ("COMMA",),
("NUMBER", 5), ("RIGHT-BRACKET")]
This step is actually not too bad: we can write such a tokenizer by using
some string manipulation.
Here's a tokenize() that does the grunt-work:
###
import re
def tokenize(s):
"""Breaks the string 's' into a list of tokens. A token can be of
the following types:
(LEFT-BRACKET,)
(RIGHT-BRACKET,)
(COMMA,)
(NUMBER, n) where 'n' is an integer
"""
tokens = []
patterns = [
(re.compile(r'\['), "LEFT-BRACKET"),
(re.compile(r'\]'), "RIGHT-BRACKET"),
(re.compile(','), "COMMA"),
(re.compile('\d+'), "NUMBER")
]
while True:
if not s: break
for regex, token_type in patterns:
match = regex.match(s)
if match:
## Treat numbers slightly differently, since we want
## to "int" the result.
if token_type == 'NUMBER':
tokens.append(('NUMBER', int(match.group())))
else:
tokens.append((token_type,))
s = s[match.end():]
break
else:
raise ValueError, ("I dunno how to parse %r" % s)
return tokens
###
Hmm... that's slightly messy; my apologies!
Before I go on, I'd better mention that there are good automated tools for
Python that we can use to handle tokenization and parsing; I'm just
showing how we'd do this all from scratch. If we were to use something
like the Ply package:
http://systems.cs.uchicago.edu/ply/
then all of this would be much much shorter. I want to show how this
stuff works, by hand, so that the automated tools will make more sense.
But if I were doing this for real, I'd definitely use a tool like Ply
instead. *grin*
Anyway, let's see how this tokenize() works:
###
>>> tokenize("[1,2,3]")
[('LEFT-BRACKET',), ('NUMBER', 1), ('COMMA',), ('NUMBER', 2), ('COMMA',),
('NUMBER', 3), ('RIGHT-BRACKET',)]
>>>
>>>
>>> tokenize("[3,1,4,1,5]")
[('LEFT-BRACKET',), ('NUMBER', 3), ('COMMA',), ('NUMBER', 1), ('COMMA',),
('NUMBER', 4), ('COMMA',), ('NUMBER', 1), ('COMMA',), ('NUMBER', 5),
('RIGHT-BRACKET',)]
>>>
>>> tokenize("I should break")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/tmp/python-16929BsN", line 34, in tokenize
ValueError: I dunno how to parse 'I should break'
###
Ok, we have a tokenizer, but we still don't have a list of numbers.
That's where the "parser" part comes in. A parser is the thing that'll
reconstruct the number list from these tokens. A parser is actually a
bunch of grammar rules. In English, for example, we know that a sentence
can be made up of a "subject", a "verb" and an "object":
sentence ::= subject verb object
For example:
I love bees
is a sentence; "I" is the subject, "love" is the verb, and "bees" is the
object.
sentence ::= subject verb object
--> "I" verb object
--> "I" "love" object
--> "I" "love" "bees"
Our job is to figure out the grammatical rules that number lists follow;
once we've done this, then writing a parser to recognize those rules will
be a surprisingly direct translation.
Let's look at a few examples of number lists.
[1,2,3]
[3,1,4,1,5]
[7]
[]
We can say that a number list is made up of a left bracket, a bunch of
comma-separated numbers, and a right bracket. More formally, we'd say:
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
But what do comma-separated numbers look like? That can either be empty
(like the last example, when we want to handle "[]"):
nums ::=
Or they can have a single number:
nums ::= NUMBER
And of course, they can have more than one number: let's handle that too:
nums ::= NUMBER comma-nums
comma-nums ::=
comma-nums ::= COMMA NUMBER comma-nums
That's actually all the rules we need. Let me bring all these "grammar"
rules together in one place:
###
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
nums ::=
nums ::= NUMBER
nums ::= NUMBER comma-nums
comma-nums ::=
comma-nums ::= COMMA NUMBER comma-nums
###
Hmm... Actually, one of those rules is redundant; let me take that
redundant rule out:
###
1. number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
2. nums ::=
3. nums ::= NUMBER comma-nums
4. comma-nums ::=
5. comma-nums ::= COMMA NUMBER comma-nums
###
Ok, here is our grammar for number lists. I'm just numbering our rules to
make life a little easier.
But this is all very weird and inscrutable! *grin* So let's take a quick
look at how these rules work. Let's say that we have something like
"[1,2]". How can we get our grammar rules to fit our tokens? We should
start off with the main rule, the 'number-list' rule:
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
When we're trying to figure out what 'nums' stands for, we have two
possible rules to choose from:
###
2. nums ::=
3. nums ::= NUMBER comma-nums
###
Let's substitute nums with Rule Three:
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
--> LEFT-BRACKET NUMBER comma-nums RIGHT-BRACKET
We can then substitute comma-nums with Rule Five:
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
--> LEFT-BRACKET NUMBER comma-nums RIGHT-BRACKET
--> LEFT-BRACKET NUMBER COMMA NUMBER comma-nums
RIGHT-BRACKET
and then finally choose Rule Four on 'comma-nums':
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
--> LEFT-BRACKET NUMBER comma-nums RIGHT-BRACKET
--> LEFT-BRACKET NUMBER COMMA NUMBER comma-nums
RIGHT-BRACKET
--> LEFT-BRACKET NUMBER COMMA NUMBER RIGHT-BRACKET
And the final derived rule is something that'll fit against "[1,2]".
--> LEFT-BRACKET NUMBER COMMA NUMBER RIGHT-BRACKET
[ 1 , 2 ]
Let's look at those five rules again:
###
number-list ::= LEFT-BRACKET nums RIGHT-BRACKET
nums ::= NUMBER comma-nums
nums ::=
comma-nums ::= COMMA NUMBER comma-nums
comma-nums ::=
###
It turns out that turning these rules into Python code is actually pretty
straightforward. Let's show what that looks like:
###
def peek(tokens):
"""Peeks at the next token in our token list."""
return tokens[0]
def eat(tokens, token_type):
"""Eats the next token, and returns it."""
assert tokens[0][0] == token_type
return tokens.pop(0)
def parse_number_list(tokens):
"""number-list ::= LEFT-BRACKET nums RIGHT-BRACKET"""
eat(tokens, 'LEFT-BRACKET')
result = parse_nums(tokens)
eat(tokens, 'RIGHT-BRACKET')
return result
def parse_nums(tokens):
"""nums ::= NUMBER comma-nums
nums ::=
"""
if peek(tokens)[0] == 'NUMBER':
number = eat(tokens, 'NUMBER')[1]
other_numbers = parse_comma_nums(tokens)
return [number] + other_numbers
else:
return []
def parse_comma_nums(tokens):
"""comma_nums ::= COMMA NUMBER comma-nums
comma_nums ::=
"""
if peek(tokens)[0] == 'COMMA':
eat(tokens, 'COMMA')
number = eat(tokens, 'NUMBER')[1]
other_numbers = parse_comma_nums(tokens)
return [number] + other_numbers
else:
return []
###
There's our parser. Each one of the 'parse_*' is responsible for choosing
which rule can be applied. For example, the 'comma-num' rule has two
choices to deal with:
comma-nums ::= COMMA NUMBER comma-nums
comma-nums ::=
and so the corresponding 'parse_comma_nums' function does a quick check to
see if the tokens have a COMMA coming up next. If so, then the first rule
has to apply. And otherwise, the "empty" rule is in effect.
Let's just check to see that all this mess does something useful.
###
>>> parse_number_list(tokenize('[1,2]'))
[1, 2]
>>>
>>> mylist = parse_number_list(tokenize('[3,1,4,1,5]'))
>>> mylist[0]
3
###
Ok, good, this actually works. *grin*
Now, all this work is complete overkill for parsing a simple list of
numbers. But it's still slightly useful: notice that we hadn't handled
whitespace: we might want to let people put spaces between commas, like
"[1, 2, 3]"
We can handle this by modifiying the tokenize() function to ignore
whitespace. The nice thing to see is that the parsers don't need to worry
so much about lexical issues.
Another nice point is that the parsing technique can handle more
complicated situations. Let's say we'd like to handle nested number
lists:
"[1,2,[3,[4]]]"
This might look a bit scary, but we can actually attack this problem in
almost the same way as the previous example. Here's a grammar to parse
nested number lists:
###
element-list ::= LEFT-BRACKET elements RIGHT-BRACKET
elements ::= element comma-elements
elements ::=
element ::= NUMBER
element ::= element-list
comma-elements ::= COMMA element comma-elements
comma-elements ::=
###
The only big change here is that we modify the 'nums' rule so that it can
handle arbitrary 'elements'. And elements can either be simple numbers or
a nested list.
Here's a translation of that grammar into Python code:
###
def parse_element_list(tokens):
eat(tokens, 'LEFT-BRACKET')
result = parse_elements(tokens)
eat(tokens, 'RIGHT-BRACKET')
return result
def parse_elements(tokens):
if peek(tokens)[0] in ('NUMBER', 'LEFT-BRACKET'):
elt = parse_element(tokens)
other_elts = parse_comma_elements(tokens)
return [elt] + other_elts
else:
return []
def parse_element(tokens):
if peek(tokens)[0] == 'NUMBER':
return eat(tokens, 'NUMBER')[1]
else:
return parse_element_list(tokens)
def parse_comma_elements(tokens):
if peek(tokens)[0] == 'COMMA':
eat(tokens, 'COMMA')
elt = parse_element(tokens)
other_elts = parse_comma_elements(tokens)
return [elt] + other_elts
else:
return []
###
And this works as well, without have to make changes to the tokenizer:
###
>>> parse_element_list(tokenize('[1,[2,[3,[4]]]]'))
[1, [2, [3, [4]]]]
###
I went through this really darn fast, without explaining much of it at
all. But I hope that some of it made some sense. *grin* The parsing
technique that we've used here is known as "recursive descent".
As a last note, recursive-descent parsing is the technique that Python
itself uses to parse its language! If we look at the Python Reference
Manual:
http://www.python.org/doc/ref/exprlists.html#tok-expression_list
we can see other examples of grammar rules that look suspiciously similar
to the list-of-numbers example that we just worked out. *grin*
I hope this helps!
More information about the Tutor
mailing list