Regular Expressions
skip at pobox.com
skip at pobox.com
Mon Feb 12 12:17:14 EST 2007
dbl> The source of HTMLParser and xmllib use regular expressions for
dbl> parsing out the data. htmllib calls sgmllib at the begining of it's
dbl> code--sgmllib starts off with a bunch of regular expressions used
dbl> to parse data.
I am almost certain those modules use regular expressions for lexical
analysis (splitting the input byte stream into "words"), not for parsing
(extracting the structure of the "sentences").
If I have a simple expression:
(7 + 3.14) * CONST
that's just a stream of bytes, "(", "&", " ", "+", ... Lexical analysis
chunks that stream of bytes into the "words" of the language:
LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST")
Parsing then constructs a higher level representation of that stream of
"words" (more commonly called tokens or lexemes). That representation is
application-dependent.
Regular expressions are ideal for lexical analysis. They are not-so-hot for
parsing unless the grammar of the language being parsed is *extremely*
simple.
Here are a couple much better expositions on the topics:
http://en.wikipedia.org/wiki/Lexical_analysis
http://en.wikipedia.org/wiki/Parsing
Skip
More information about the Python-list
mailing list