[Tutor] pattern expressions

Fri Nov 7 20:34:06 CET 2008

Question 1:
format_code	:= '+' | '-' | '*' | '#'
I need to specify that a single, identical, format_code code may be
repeated. 
Not that a there may be several one on a sequence.
format		:= (format_code)+
would catch '+-', which is wrong. I want only patterns such as '--',
'+++',...

This interpretation of '+' in your BNF is a bit out of the norm.  Usually
this notation 'format_code+' would accept 1 or more of any of your
format_code symbols, so '+-+--++' would match.

In pyparsing, you could match things like '----' using the Word class and
specifying a string containing the single character '-':  Word('-').  That
is, parse a word made up of '-' characters.  There is no pyparsing construct
that exactly matches your (format_code)+ repetition, but you could use Word
and MatchFirst as in:

format = MatchFirst(Word(c) for c in "+-*#")

A corresponding regular expression might be:
formatRE = '|'.join(re.escape(c)+'+' for c in "+-*#")

which you could then parse using the re module, or wrap in a pyparsing Regex
object:

format = Regex(formatRE)

Question 2:
style_code	:= '/' | '!' | '_'
Similar case, but different. I want patterns like:
styled_text	:= style plain_text style
where both style instances are identical. As the number of styles may grow
(and even be impredictable: the style_code line will actually be written at
runtime according to a config file) I don't want, and anyway can't, specify
all possible kinds of styled_text. Even if possible, it would be ugly!

pyparsing includes to methods to help you match the same text that was
matched before - matchPreviousLiteral and matchPreviousExpr.  Here is how
your example would look:

plain_text = Word(alphanums + " ")
styled_text = style + plain_text + matchPreviousLiteral(style)

(There is similar capability in regular expressions, too.)

Question 3:
I would like to specify a "side-condition" for a pattern, meaning that it
should only when a specific token lies aside. For instance:
A	:= A_pattern {X}
X is not part of the pattern, thus should not be extracted. If X is just
"garbage", I can write an enlarged pattern, then let it down later:
A	:= A_pattern
A_X	:= A X

I think you might be looking for some kind of lookahead.  In pyparsing, this
is supported using the FollowedBy class.

A_pattern = Word(alphas)
X = Literal(".")
A = A_pattern + FollowedBy(X).leaveWhitespace()

print A.searchString("alskd sldjf sldfj. slfdj . slfjd slfkj.")

prints

[['sldfj'], ['slfkj']]