ANN: 'rex' 0.5, a module for easier creation and use of regular expressions.

bp foo at bar.com
Mon Jun 28 00:15:02 CEST 2004


NOTE: This is my second attempt at posting this; as far as I can tell,
the first posting never appeared on the newsgroups. If you've already
seen this, sorry for the repeat.

WHAT IS 'rex'?

rex is a module which makes the creation and use of regular expressions,
IMHO, much easier than is the case using the 're' module. It does this
by subclassing strings to provide a special regular-expresssion-type
string, and then defining operations on that class. This permits
building regular expressions without the use of arcane sytax, and
with the help of Python edit modes and the Python syntax checker.
It also makes regular expressions far, far easier to use.

WHERE TO GET IT?

http://homepage.mac.com/ken.mcdonald/FileSharing2.html 


WHAT ELSE?

I'd appreciate feedback and suggestions. Code contributions
are welcome too. I don't think I've mentioned the license
in the 0.5 docs, but it will change from a custom license
in the previous version to a PSA or similar standard license
in the final version.

Appended below is the (current) documentation, which could be
better, but isn't too bad. It's missing function/method reference
documentation, but at least shows you what rex can do. The next version
will have substantially more complete documentation. My editor
'helpfully' autowrapped the documentation--it's easier to read in the
copy included in the module.




rex: A Module to Provide a Better Interface to Regular Expressions.
===================================================================

'rex' provides a much better interface for creating and using regular
expressions. It is built on top of, and and intended as a functional
replacement
for, the Python 're' module.

Introduction
============

'rex' stands for any of (your choice):

    - Regular Expression eXtensions

    - Regular Expressions eXpanded

    - Rex, King of Regular Expressions (ha, ha ha, ha).


rex provides a completely different way of writing regular expressions
(REs).
You do not use strings to write any part of the RE _except_ for regular
expression literals. No escape characters, metacharacters, etc. Regular
expression operations, such as repetition, alternation, concatenation,
etc., are
done via Python operators, methods, or functions.

The major advantages of rex are:

    - [This is a biggie.] rex permits complex REs to be built up easily
of
      smaller parts. In fact, a rex definition for a complex RE is
likely to end
      up looking somewhat like a mini grammar.

    - [Another biggie.] As an ancillary to the above, rex permits REs
to be
      easily reused.

    - rex expressions are checked for well-formedness by the Python
parser; this
      will typically provide earlier and easier-to-understand diagnoses
of
      syntactically malformed regular expressions

    - rex expressions are all strings! They are, in fact, a specialized
subclass
      of strings, which means you can pass them to existing code which
expects
      REs.

    - rex goes to some lengths to produce REs which are similar to
those written
      by hand, i.e. it tries to avoid unnecessary use of nongrouping
      parentheses, uses special escape sequences where possible, writes
'A?'
      instead of 'A{0,1}', etc. In general, rex tries to produce
concise REs, on
      the theory that if you really need to read the buggers at some
point, it's
      easier to read simpler ones than more complex ones.


As an example, take a look at the definition of an RE matching a
complex number,
an example included in the test_rex.py. The rex Python code to do this
is:

    COMPLEX= (
      PAT.aFloat['re']
      + PAT.anyWhitespace 
      + ALT("+", "-")['op']
      + PAT.anyWhitespace
      + PAT.aFloat['im'] 
      + 'i'
    )


while the analogous RE is:

   
(?P<re>(?:\+|\-)?\d+(?:.\d*)?)\s*(?P<op>\+|\-)\s*(?P<im>(?:\+|\-)?\d+(?:.
\d*)?)i


The rex code is more verbose than the simple RE (which, by the way, was
the RE
generated by the rex code, and is pretty much what you'd produce by
hand). It is
also FAR easier to read, modify, and debug. And, it illustrates how
easy it is
to reuse rex patterns: PAT.aFloat and PAT.anyWhitespace are predefined
patterns
provided in rex which match, respectively, a string representation of a
floating
point number (no exponent), and a sequence of zero or more whitespace
characters.

Using rex
=========

This is a quick overview of how to use rex. See documentation
associated with a
specific method/function/name for details on that entity.

In the following, we use the abbreviation RE to refer to standard
regular
expressions defined as strings, and the word 'rexp' to refer to rex
objects
which denote regular expressions.

The starting point for building a rexp is either rex.PAT, which we'll
just call
PAT, or rex.CHAR, which we'll just call CHAR. CHAR builds rexps which
match
single character strings. PAT builds rexps which match strings of
varying
lengths.

    - PAT(string) returns a rexp which will match exactly the string
given, and
      nothing else.

    - PAT._someattribute_ returns (for defined attributes) a
corresponding rexp.
      For example, PAT.aDigit returns a rexp matching a single digit.

    - CHAR(a1, a2, . . .) returns a rexp matching a single character
from a set
      of characters defined by its arguments. For example, CHAR("-",
["0","9"],
      ".") matches the characters necessary to build basic floating
point
      numbers. See CHAR docs for details.


Now assume that A, B, C,... are rexps. The following Python expressions
(_not_
strings) may be used to build more complex rexps:

    - A | B | C . . . : returns a rexp which matches a string if any of
the
      operands match that string. Similar to "A|B|C" in normal REs,
except of
      course you can't use Python code to define a normal RE.

    - A + B + C ...: returns a rexp which matches a string if all of A,
B, C
      match consecutive substrings of the string in succession. Like
"ABC" in
      normal REs.

    - A*n : returns a rexp which matches a number of times as defined
by n. This
      replaces '?', '+', and '*' as used in normal REs. See docs for
details.
      'rex' defines constants which allow you to say A*ANY, A*SOME, or
A*MAYBE,
      indicating (0 or more matches), (1 or more matches), or (0 or 1
matches),
      respectively.

    - A**n : Like A*n, but does nongreedy matching.

    - +A : positive lookahead assertion: matches if A matches, but
doesn't
      consume any of the input.

    - ~+A : negative lookahead assertion: matches of A _doesn't_ match,
but
      doesn't consume any of the input.

    - -A, ~-A : positive and negative lookback assertions. Lke lookahead
      assertions, but in the other direction.

    - A[name] : name must be a string: anything matched by A can be
referred to
      by the given name in the match result object. (This is the
equivalent of
      named groups in the re module).

    - A.group() : A will be in an unnamed group, referable by number.


In addition, a few other operations can be done:

    - Some of the attributes defined in PAT have "natural inverses";
for such
      attributes, the inverse may be taken. For example, ~ PAT.digit is
a
      pattern matching any character except a digit.

    - Character classes may be inverted: ~CHAR("aeiouAEIOU") returns a
pattern
      matching anything except a vowel.

    - 'ALT' gives a different way to denote alternation: ALT(A, B,
C,...) does
      the same thing as A | B | C | . . ., except that none of the
arguments to
      ALT need be rexps; any which are normal strings will be converted
to a
      rexp using PAT.

    - 'PAT' can take multiple arguments: PAT(A, B, C,...), which gives
the same
      result as PAT(A) + PAT(B) + PAT(C) + . . . .


Finally, a very convenient shortcut is that only the first object in a
sequence
of operator/method calls needs to be a rexp; all others will be
automatically
converted as if PAT[...] had been called on them. For example, the
sequence A |
"hello" is the same as A | PAT("hello")

rex Character Classes
=====================

CHAR(args...) defines a character class. Arguments are any number of
strings or
two-tuples/two-element lists. eg.

    CHAR("ab-z")


is the same as the regular expression r"[ab\-z]". NOTE that there are no
'character range metacharacters'; the preceding define a character class
containing four characters, one of which was a '-'.

This is a character class containing a backslash, hyphen, and open/close
brackets:

    CHAR(r"\-[]")


or

    CHAR("\-[]")


Note that we still need to use raw strings to turn off normal Python
string
escaping.

To define ranges, do this :

    CHAR(["a","z"], ["A","Z"])


To define inverse ranges, use the ~ operator, eg. To define the class
of all
non-numeric characters:

    ~CHAR(["0","9"])


Character classes cannot (yet) be doubly negated: ~~CHAR("A") is an
error.

Predefined Constants
====================

rex provides a number of predefined patterns which will likely be of
use in
common cases. Generally speaking, rex constant pattern names begin with
'a' or
'an' (indicating a pattern that matches a single instance), 'any'
(indicating a
pattern that matches 0 or more instances), 'some' (indicating a pattern
that
matches 1 or more instances), and 'optional' (meaning the pattern
matchs 0 or 1
instance.) Some special names are also provided.

The 'rex' module may define other constant names, but you should only
use those
below; others may change in future release of rex.

    - Matches any character: aChar, someChars, anyChars

    - Matches digits (0-9): aDigit, someDigits, anyDigits

    - Matches whitespace characters: aWhitespace, someWhitespace,
anyWhitespace

    - Matches letters (a-z, A-Z): aLetter, someLetters, anyLetters

    - Numeric values (signed or unsigned, no exponent): anInt, aFloat

    - Match only the start or end of the string: stringStart, stringEnd

    - Match only at a word border: wordBorder

    - Matches the emptyString: emptyString

    - Any punctuation (non whitespace) character on a standard US
keyboard):
      aPunctuationMark



More information about the Python-list mailing list