Suggestion for a new regular expression extension

Thu Nov 20 11:15:28 EST 2003

Hi,

I'm currently writing various regular expressions designed to help me parse
some real-world French postal addresses. The task is not easy due to the
vast amount of abbreviations, misspelling and variations in adresses. Just
to give you a taste of what the regular expression looks like (unoptimized
and perfectible, but for now it performs well enough) :

re_adresse = re.compile(r'''
    (?P<street_number>\d+(?:[ /\-]\d+)?)?
    \s*
    (?:(?P<street_number_extension>
        A
    |   B(?:IS)?
    |   C
    |   E
    |   F
    |   T(?:ER|RE)?
    |   Q(?:UATER)?
    )\b)?
    \s*
    (?P<street_type>(?:
        (?:G(?:DE?|RDE?|RANDE?)\s+)?R(?:UE)?
    ....... (snip) ....
    |   B(?:D|LD|VD|OUL(?:EVARD)?)
    ....... (snip) ....
    )\b)?
    (?:\s*(?P<street_name>.+))?
    $
''',re.X)

Note for example the many abbreviations (correct or not) ouf "boulevard" :
BD, BLD, BVD, BOUL, BOULEVARD. For normalisation purposes, I need to
transform all those forms into the only correct abbreviation, BD.

What would be really, really neat, would be a regular expression extension
notation that would make the RE engine to return an arbitrary string when a
substring is matched. The standard parenthesis operator return the matched
text, whereas this extension would return any arbitrary text when getting a
match.

In my particular case, it would be very handy, allowing me to tell the RE
engine to return me "BD" when matching B(?:D|LD|VD|OUL(?:EVARD)?). For now,
without the extension, I need a two-pass process. First I try to "tokenize"
the adress using the big regular expression cited above, then for each token
I try to normalize it using a duplicate of the regular expression. This
forces me to have two separate regular expression sets and requires maybe
twice the processing power, whereas with an appropriate RE extension, all
this could be done in a single pass.

This extension would also be quite interesting to build transliterators,
especially if the returned value could include references to other captured
string.

Let's say the extension would be written (?PR<text to return when
parenthesis matches>regular expression), with P meaning <P>ython extension
(to keep consistency within sre_parse.py) and R meaning <R>ewrite. Here is a
sample run :

>>> r = re.compile(r'^(\d+)\s+(?R<BD>B(?:D|LD|VD|OUL(?:EVARD)?))\s+(.*)$')
>>> r.match('15 BD HAUSSMANN').groups()
('15','BD','HAUSSMANN')
>>> r.match('15 BLD HAUSSMANN').groups()
('15','BD','HAUSSMANN')
>>> r.match('15 BOULEVARD HAUSSMANN').groups()
('15','BD','HAUSSMANN')

Perhaps the rewriting expression could include reference to other matched
parentheses (but ) :

>>> r = re.compile(r'(?R<\1\1>\d+)\s+\d+')
>>> r.match('15 40').groups()
('1515','40')
>>> r = re.compile(r'(?R<\1\2>\d+)\s+(\d+)')
>>> r.match('1 4').groups()
('14','4')

Maybe forward references would be too difficult to handle. The difficulty
with this would be how to handle an expression like (?R<\2>.+)(\1) (throw an
exception ?). The simplest thing to do would be to only allow back
references, or only references to the current match of the parenthesis, with
a notation like \m :

>>> r = re.compile(r'.*(?R<$\m.00">\d+).*')
>>> r.match('1540').group(0)
'1540'
>>> r.match('1540').group(1)
'$1540.00'

But anyway the reference to other groups in the rewriting expression would
be only a plus. The core suggestion is just the rewrite extension.

I also considered using sre.Scanner to do the stuff, but does anyone know
what is the status of this class ? I made a few test and it seems to work,
but it is still marked as 'experimental'. Why ? Last reference I saw to this
class is there :
http://aspn.activestate.com/ASPN/Mail/Message/python-dev/1614505... So, is
this class good enough for common usage ? Anyway, this wouldn't suffice here
because I would need a Scanner for the full adresse using different
sub-Scanners for each address part...

Best regards,
Nicolas