[Tutor] regex questions

Fri Feb 18 10:36:07 CET 2011

Hi Steven,

Thanks a BUNCH for helping me! Yes, you were correct in assuming that my input 
data are already names. They're names in a column in a csv file.  They're the 
names of GPs, in various formats. Besides the forms I've mentioned already there 
are examples such as 'Doctor's office Duh, J. & Dah, J.', with or without 
initials and/or connecting words. There are also offices with names as Doctor's 
Office 'Under the Oaks'. I want to normalise those cases too till  'Doctor's 
office J. Duh & J. Dah', etc. Currently I use " & ".split() and apply my regexes 
(I use three, so I will certainly study your very fancy function!).

So the raw string \b means means "ASCII backspace". Is that another way of 
saying that it means 'Word boundary'?

You're right: debugging regexes is a PIA. One teeny weeny mistake makes all the 
difference. Could one say that, in general, it's better to use a Divide and 
Conquer strategy and use a series of regexes and other string operations to 
reach one's goal?

http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/ 
is interesting. I did something similar with  unicode.translate(). Many people 
here have their keyboard settings as US, so accented letters are not typed very 
easily, and are therefore likely to be omitted (e.g. enquête vs enquete). 

Thanks again!

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public 
order, irrigation, roads, a fresh water system, and public health, what have the 
Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

________________________________
From: Steven D'Aprano <steve at pearwood.info>
To: Python Mailing List <tutor at python.org>
Sent: Fri, February 18, 2011 4:45:42 AM
Subject: Re: [Tutor] regex questions

Albert-Jan Roskam wrote:
> Hello,
> 
> I have a couple of regex questions:
> 
> 1 -- In the code below, how can I match the connecting words 'van de' , 'van 
>der', etc. (all quite common in Dutch family names)?

You need to step back a little bit and ask, what is this regex supposed to 
accomplish? What is your input data? Do you expect this to tell the difference 
between van used as a connecting word in a name, and van used otherwise?

In other words, do you want:

re.search(???, "J. van Meer")  # matches
re.search(???, "The van stopped")  # doesn't match

You might say, "Don't be silly, of course not!" *but* if you expect this regex 
to detect names in arbitrary pieces of text, that is exactly what you are hoping 
for. It is beyond the powers of a regex alone to distinguish between arbitrary 
text containing a name:

"... and to my nephew Johann van Meer I leave my collection of books..."

and arbitrary text without a name:

"... and the white van merely touched the side of the building..."

You need a proper parser for that.

I will assume that your input data will already be names, and you just want to 
determine the connecting words:

van der
van den
van de
van

wherever they appear. That's easy: the only compulsory part is "van":

pattern = r"\bvan\b( de[rn]?)?"

Note the use of a raw string. Otherwise, \b doesn't mean "backslash b", but 
instead means "ASCII backspace".

Here's a helper function for testing:

def search(pattern, string):
    mo = re.search(pattern, string, re.IGNORECASE)
    if mo:
        return mo.group(0)
    return "--no match--"

And the result is:

>>> names = ["J. van der Meer", "J. van den Meer", "J. van Meer",
... "Meer, J. van der", "Meer, J. van den", "Meer, J. van de",
... "Meer, J. van"]
>>>
>>> for name in names:
...     print search(pattern, name)
...
van der
van den
van
van der
van den
van de
van

Don't forget to test things which should fail:

>>> search(pattern, "Fred Smith")
'--no match--'
>>> search(pattern, "Fred Vanderbilt")
'--no match--'

> 2 -- It is quite hard to make a regex for all surnames, but easier to make 

"\b[a-z]+[-']?[a-z]*\b" should pretty much match all surnames using only English 
letters, apostrophes and hyphens. You can add in accented letters as need.

(I'm lazy, so I haven't tested that.)

> regexes for the initials and the connecting words. How could I ' subtract'  
>those two regexes to end up with something that matches the surnames (I used two 
>.replaces() in my code, which roughly work, but I'm thinking there's a re way to 
>do it, perhaps with carets (^).

Don't try to use regexes to do too much. Regexes are a programming language, but 
the syntax is crap and there's a lot they can't do. They make a good tool for 
*parts* of your program, not the whole thing!

The best approach, I think, is something like this:

def detect_dutch_name(phrase):
    """Return (Initial, Connecting-words, Surname) from a potential
    Dutch name in the form "Initial [Connecting-words] Surname" or
    "Surname, Initial Connecting-words".
    """
    pattern = (  r"(?P<surname>.*?), "
                 r"(?P<initial>[a-z]\.) ?(?P<connect>van (de[rn]?))?"  )
    mo = re.match(pattern, phrase, re.IGNORECASE)
    if mo:
        return (mo.group('initial'), mo.group('connect') or '',
                mo.group('surname'))
    # Try again.
    pattern = (  r"(?P<initial>[a-z]\.) "
                 r"(?P<connect>van (de[rn]? ))?(?P<surname>.*)"  )
    # Note: due to laziness, I accept any character at all in surnames.
    mo = re.match(pattern, phrase, re.IGNORECASE)
    if mo:
        return (mo.group('initial'), mo.group('connect') or '',
                mo.group('surname'))
    return ('', '', '')

Note that this is BUGGY -- it doesn't do exactly what you want, although it is 
close:

>>> detect_dutch_name("Meer, J. van der")  # Works fine
('J.', 'van der', 'Meer')

but:

>>> detect_dutch_name("J. van der Meer")  # almost, except for the space
('J.', 'van der ', 'Meer')
>>> detect_dutch_name("J. van Meer")  # not so good
('J.', '', 'van Meer')

Debugging regexes is a PITA and I'm going to handball the hard part to you :)

> 3 -- Suppose I want to yank up my nerd rating by adding a re.NONDIACRITIC flag 
>to the re module (matches letters independent of their accents), how would I go 
>about? Should I subclass from re and implement the method, using the other 
>existing methods as an example? I would find this a very useful addition.

As would lots of people, but it's not easy, and it's not even certain to me that 
it is always meaningful.

In English, and (I believe) Dutch, diacritics are modifiers, and so accented 
letters like é are considered just a variant of e. (This is not surprising, for 
English is closely related to Dutch.) But this is not a general rule -- many 
languages which use diacritics consider the accented letters to be as distinct.

See, for example, http://en.wikipedia.org/wiki/Diacritic for a discussion of 
which diacritics are considered different letters and which are not.