[Tutor] regex questions

Fri Feb 18 04:45:42 CET 2011

Albert-Jan Roskam wrote:
> Hello,
> 
> I have a couple of regex questions:
> 
> 1 -- In the code below, how can I match the connecting words 'van de' , 'van 
> der', etc. (all quite common in Dutch family names)?

You need to step back a little bit and ask, what is this regex supposed 
to accomplish? What is your input data? Do you expect this to tell the 
difference between van used as a connecting word in a name, and van used 
otherwise?

In other words, do you want:

re.search(???, "J. van Meer")  # matches
re.search(???, "The van stopped")  # doesn't match

You might say, "Don't be silly, of course not!" *but* if you expect this 
regex to detect names in arbitrary pieces of text, that is exactly what 
you are hoping for. It is beyond the powers of a regex alone to 
distinguish between arbitrary text containing a name:

"... and to my nephew Johann van Meer I leave my collection of books..."

and arbitrary text without a name:

"... and the white van merely touched the side of the building..."

You need a proper parser for that.

I will assume that your input data will already be names, and you just 
want to determine the connecting words:

van der
van den
van de
van

wherever they appear. That's easy: the only compulsory part is "van":

pattern = r"\bvan\b( de[rn]?)?"

Note the use of a raw string. Otherwise, \b doesn't mean "backslash b", 
but instead means "ASCII backspace".

Here's a helper function for testing:

def search(pattern, string):
     mo = re.search(pattern, string, re.IGNORECASE)
     if mo:
         return mo.group(0)
     return "--no match--"

And the result is:

 >>> names = ["J. van der Meer", "J. van den Meer", "J. van Meer",
... "Meer, J. van der", "Meer, J. van den", "Meer, J. van de",
... "Meer, J. van"]
 >>>
 >>> for name in names:
...     print search(pattern, name)
...
van der
van den
van
van der
van den
van de
van

Don't forget to test things which should fail:

 >>> search(pattern, "Fred Smith")
'--no match--'
 >>> search(pattern, "Fred Vanderbilt")
'--no match--'

> 2 -- It is quite hard to make a regex for all surnames, but easier to make 

"\b[a-z]+[-']?[a-z]*\b" should pretty much match all surnames using only 
English letters, apostrophes and hyphens. You can add in accented 
letters as need.

(I'm lazy, so I haven't tested that.)

> regexes for the initials and the connecting words. How could I ' subtract'  
> those two regexes to end up with something that matches the surnames (I used two 
> .replaces() in my code, which roughly work, but I'm thinking there's a re way to 
> do it, perhaps with carets (^).

Don't try to use regexes to do too much. Regexes are a programming 
language, but the syntax is crap and there's a lot they can't do. They 
make a good tool for *parts* of your program, not the whole thing!

The best approach, I think, is something like this:

def detect_dutch_name(phrase):
     """Return (Initial, Connecting-words, Surname) from a potential
     Dutch name in the form "Initial [Connecting-words] Surname" or
     "Surname, Initial Connecting-words".
     """
     pattern = (  r"(?P<surname>.*?), "
                  r"(?P<initial>[a-z]\.) ?(?P<connect>van (de[rn]?))?"  )
     mo = re.match(pattern, phrase, re.IGNORECASE)
     if mo:
         return (mo.group('initial'), mo.group('connect') or '',
                 mo.group('surname'))
     # Try again.
     pattern = (  r"(?P<initial>[a-z]\.) "
                  r"(?P<connect>van (de[rn]? ))?(?P<surname>.*)"  )
     # Note: due to laziness, I accept any character at all in surnames.
     mo = re.match(pattern, phrase, re.IGNORECASE)
     if mo:
         return (mo.group('initial'), mo.group('connect') or '',
                 mo.group('surname'))
     return ('', '', '')

Note that this is BUGGY -- it doesn't do exactly what you want, although 
it is close:

 >>> detect_dutch_name("Meer, J. van der")  # Works fine
('J.', 'van der', 'Meer')

but:

 >>> detect_dutch_name("J. van der Meer")  # almost, except for the space
('J.', 'van der ', 'Meer')
 >>> detect_dutch_name("J. van Meer")  # not so good
('J.', '', 'van Meer')

Debugging regexes is a PITA and I'm going to handball the hard part to 
you :)

> 3 -- Suppose I want to yank up my nerd rating by adding a re.NONDIACRITIC flag 
> to the re module (matches letters independent of their accents), how would I go 
> about? Should I subclass from re and implement the method, using the other 
> existing methods as an example? I would find this a very useful addition.

As would lots of people, but it's not easy, and it's not even certain to 
me that it is always meaningful.

In English, and (I believe) Dutch, diacritics are modifiers, and so 
accented letters like é are considered just a variant of e. (This is not 
surprising, for English is closely related to Dutch.) But this is not a 
general rule -- many languages which use diacritics consider the 
accented letters to be as distinct.

See, for example, http://en.wikipedia.org/wiki/Diacritic for a 
discussion of which diacritics are considered different letters and 
which are not.