[Tutor] regex questions
Steven D'Aprano
steve at pearwood.info
Fri Feb 18 04:45:42 CET 2011
Albert-Jan Roskam wrote:
> Hello,
>
> I have a couple of regex questions:
>
> 1 -- In the code below, how can I match the connecting words 'van de' , 'van
> der', etc. (all quite common in Dutch family names)?
You need to step back a little bit and ask, what is this regex supposed
to accomplish? What is your input data? Do you expect this to tell the
difference between van used as a connecting word in a name, and van used
otherwise?
In other words, do you want:
re.search(???, "J. van Meer") # matches
re.search(???, "The van stopped") # doesn't match
You might say, "Don't be silly, of course not!" *but* if you expect this
regex to detect names in arbitrary pieces of text, that is exactly what
you are hoping for. It is beyond the powers of a regex alone to
distinguish between arbitrary text containing a name:
"... and to my nephew Johann van Meer I leave my collection of books..."
and arbitrary text without a name:
"... and the white van merely touched the side of the building..."
You need a proper parser for that.
I will assume that your input data will already be names, and you just
want to determine the connecting words:
van der
van den
van de
van
wherever they appear. That's easy: the only compulsory part is "van":
pattern = r"\bvan\b( de[rn]?)?"
Note the use of a raw string. Otherwise, \b doesn't mean "backslash b",
but instead means "ASCII backspace".
Here's a helper function for testing:
def search(pattern, string):
mo = re.search(pattern, string, re.IGNORECASE)
if mo:
return mo.group(0)
return "--no match--"
And the result is:
>>> names = ["J. van der Meer", "J. van den Meer", "J. van Meer",
... "Meer, J. van der", "Meer, J. van den", "Meer, J. van de",
... "Meer, J. van"]
>>>
>>> for name in names:
... print search(pattern, name)
...
van der
van den
van
van der
van den
van de
van
Don't forget to test things which should fail:
>>> search(pattern, "Fred Smith")
'--no match--'
>>> search(pattern, "Fred Vanderbilt")
'--no match--'
> 2 -- It is quite hard to make a regex for all surnames, but easier to make
"\b[a-z]+[-']?[a-z]*\b" should pretty much match all surnames using only
English letters, apostrophes and hyphens. You can add in accented
letters as need.
(I'm lazy, so I haven't tested that.)
> regexes for the initials and the connecting words. How could I ' subtract'
> those two regexes to end up with something that matches the surnames (I used two
> .replaces() in my code, which roughly work, but I'm thinking there's a re way to
> do it, perhaps with carets (^).
Don't try to use regexes to do too much. Regexes are a programming
language, but the syntax is crap and there's a lot they can't do. They
make a good tool for *parts* of your program, not the whole thing!
The best approach, I think, is something like this:
def detect_dutch_name(phrase):
"""Return (Initial, Connecting-words, Surname) from a potential
Dutch name in the form "Initial [Connecting-words] Surname" or
"Surname, Initial Connecting-words".
"""
pattern = ( r"(?P<surname>.*?), "
r"(?P<initial>[a-z]\.) ?(?P<connect>van (de[rn]?))?" )
mo = re.match(pattern, phrase, re.IGNORECASE)
if mo:
return (mo.group('initial'), mo.group('connect') or '',
mo.group('surname'))
# Try again.
pattern = ( r"(?P<initial>[a-z]\.) "
r"(?P<connect>van (de[rn]? ))?(?P<surname>.*)" )
# Note: due to laziness, I accept any character at all in surnames.
mo = re.match(pattern, phrase, re.IGNORECASE)
if mo:
return (mo.group('initial'), mo.group('connect') or '',
mo.group('surname'))
return ('', '', '')
Note that this is BUGGY -- it doesn't do exactly what you want, although
it is close:
>>> detect_dutch_name("Meer, J. van der") # Works fine
('J.', 'van der', 'Meer')
but:
>>> detect_dutch_name("J. van der Meer") # almost, except for the space
('J.', 'van der ', 'Meer')
>>> detect_dutch_name("J. van Meer") # not so good
('J.', '', 'van Meer')
Debugging regexes is a PITA and I'm going to handball the hard part to
you :)
> 3 -- Suppose I want to yank up my nerd rating by adding a re.NONDIACRITIC flag
> to the re module (matches letters independent of their accents), how would I go
> about? Should I subclass from re and implement the method, using the other
> existing methods as an example? I would find this a very useful addition.
As would lots of people, but it's not easy, and it's not even certain to
me that it is always meaningful.
In English, and (I believe) Dutch, diacritics are modifiers, and so
accented letters like é are considered just a variant of e. (This is not
surprising, for English is closely related to Dutch.) But this is not a
general rule -- many languages which use diacritics consider the
accented letters to be as distinct.
See, for example, http://en.wikipedia.org/wiki/Diacritic for a
discussion of which diacritics are considered different letters and
which are not.
You might like to read this:
http://www.regular-expressions.info/unicode.html
You can also look at "The Unicode Hammer" (a.k.a. "The Stupid American")
recipe:
http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
Also this snippet:
http://snippets.dzone.com/posts/show/5499
As for modifying the regex engine itself, it is written in C and is
quite a complex beast. It's not just a matter of subclassing it, but
feel free to try:
>>> x = re.compile("x")
>>> type(x)
<class '_sre.SRE_Pattern'>
First warning -- the engine itself is flagged as a private
implementation detail!
>>> import _sre
>>> class X(_sre.SRE_Pattern):
... pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'SRE_Pattern'
Second warning -- the regex class is not available from the top level of
the module.
>>> class X(type(x)):
... pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: type '_sre.SRE_Pattern' is not an acceptable base type
And the final problem: the C-based type is not suitable for subclassing
in pure Python. You could try using delegation, or working on it in C.
Good luck!
--
Steven
More information about the Tutor
mailing list