<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman,new york,times,serif;font-size:12pt">Hi Steven,<br><br>Thanks a BUNCH for helping me! Yes, you were correct in assuming that my input data are already names. They're names in a column in a csv file. They're the names of GPs, in various formats. Besides the forms I've mentioned already there are examples such as 'Doctor's office Duh, J. & Dah, J.', with or without initials and/or connecting words. There are also offices with names as Doctor's Office 'Under the Oaks'. I want to normalise those cases too till 'Doctor's office J. Duh & J. Dah', etc. Currently I use " & ".split() and apply my regexes (I use three, so I will certainly study your very fancy function!).<br><br><div>So the raw string \b means means "ASCII backspace". Is that another way of saying that it means 'Word boundary'?<br><br>You're right: debugging
regexes is a PIA. One teeny weeny mistake makes all the difference. Could one say that, in general, it's better to use a Divide and Conquer strategy and use a series of regexes and other string operations to reach one's goal?<br><br><span><a target="_blank" href="http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/">http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/</a> is interesting. I did something similar with unicode.translate(). Many people here have their keyboard settings as US, so accented letters are not typed very easily, and are therefore likely to be omitted (e.g. enqu</span><font face="Liberation Serif, serif">๊te vs enquete).</font>
<br><br>Thanks again!<br><br></div>Cheers!!<br>Albert-Jan<br><br><div>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br>All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?<br>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><br><div style="font-family: arial,helvetica,sans-serif; font-size: 13px;"><font face="Tahoma" size="2"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Steven D'Aprano <steve@pearwood.info><br><b><span style="font-weight: bold;">To:</span></b> Python Mailing List <tutor@python.org><br><b><span style="font-weight: bold;">Sent:</span></b> Fri, February 18, 2011 4:45:42 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re:
[Tutor] regex questions<br></font><br>
Albert-Jan Roskam wrote:<br>> Hello,<br>> <br>> I have a couple of regex questions:<br>> <br>> 1 -- In the code below, how can I match the connecting words 'van de' , 'van der', etc. (all quite common in Dutch family names)?<br><br>You need to step back a little bit and ask, what is this regex supposed to accomplish? What is your input data? Do you expect this to tell the difference between van used as a connecting word in a name, and van used otherwise?<br><br>In other words, do you want:<br><br>re.search(???, "J. van Meer") # matches<br>re.search(???, "The van stopped") # doesn't match<br><br>You might say, "Don't be silly, of course not!" *but* if you expect this regex to detect names in arbitrary pieces of text, that is exactly what you are hoping for. It is beyond the powers of a regex alone to distinguish between arbitrary text containing a name:<br><br>"... and to my nephew Johann van Meer I leave my collection of
books..."<br><br>and arbitrary text without a name:<br><br>"... and the white van merely touched the side of the building..."<br><br>You need a proper parser for that.<br><br>I will assume that your input data will already be names, and you just want to determine the connecting words:<br><br>van der<br>van den<br>van de<br>van<br><br>wherever they appear. That's easy: the only compulsory part is "van":<br><br>pattern = r"\bvan\b( de[rn]?)?"<br><br>Note the use of a raw string. Otherwise, \b doesn't mean "backslash b", but instead means "ASCII backspace".<br><br>Here's a helper function for testing:<br><br>def search(pattern, string):<br> mo = re.search(pattern, string, re.IGNORECASE)<br> if mo:<br> return mo.group(0)<br> return "--no match--"<br><br><br>And the result is:<br><br>>>> names = ["J. van der Meer", "J. van den Meer", "J. van Meer",<br>... "Meer, J. van der", "Meer, J.
van den", "Meer, J. van de",<br>... "Meer, J. van"]<br>>>><br>>>> for name in names:<br>... print search(pattern, name)<br>...<br>van der<br>van den<br>van<br>van der<br>van den<br>van de<br>van<br><br>Don't forget to test things which should fail:<br><br>>>> search(pattern, "Fred Smith")<br>'--no match--'<br>>>> search(pattern, "Fred Vanderbilt")<br>'--no match--'<br><br><br><br>> 2 -- It is quite hard to make a regex for all surnames, but easier to make <br><br>"\b[a-z]+[-']?[a-z]*\b" should pretty much match all surnames using only English letters, apostrophes and hyphens. You can add in accented letters as need.<br><br>(I'm lazy, so I haven't tested that.)<br><br><br>> regexes for the initials and the connecting words. How could I ' subtract' those two regexes to end up with something that matches the surnames (I used two .replaces() in my code, which roughly work, but I'm thinking
there's a re way to do it, perhaps with carets (^).<br><br>Don't try to use regexes to do too much. Regexes are a programming language, but the syntax is crap and there's a lot they can't do. They make a good tool for *parts* of your program, not the whole thing!<br><br>The best approach, I think, is something like this:<br><br><br>def detect_dutch_name(phrase):<br> """Return (Initial, Connecting-words, Surname) from a potential<br> Dutch name in the form "Initial [Connecting-words] Surname" or<br> "Surname, Initial Connecting-words".<br> """<br> pattern = ( r"(?P<surname>.*?), "<br> r"(?P<initial>[a-z]\.) ?(?P<connect>van (de[rn]?))?" )<br> mo = re.match(pattern, phrase, re.IGNORECASE)<br> if mo:<br> return (mo.group('initial'), mo.group('connect') or
'',<br> mo.group('surname'))<br> # Try again.<br> pattern = ( r"(?P<initial>[a-z]\.) "<br> r"(?P<connect>van (de[rn]? ))?(?P<surname>.*)" )<br> # Note: due to laziness, I accept any character at all in surnames.<br> mo = re.match(pattern, phrase, re.IGNORECASE)<br> if mo:<br> return (mo.group('initial'), mo.group('connect') or '',<br> mo.group('surname'))<br> return ('', '', '')<br><br>Note that this is BUGGY -- it doesn't do exactly what you want, although it is close:<br><br>>>> detect_dutch_name("Meer, J. van der") # Works fine<br>('J.', 'van der', 'Meer')<br><br>but:<br><br>>>> detect_dutch_name("J. van der Meer") # almost,
except for the space<br>('J.', 'van der ', 'Meer')<br>>>> detect_dutch_name("J. van Meer") # not so good<br>('J.', '', 'van Meer')<br><br>Debugging regexes is a PITA and I'm going to handball the hard part to you :)<br><br><br>> 3 -- Suppose I want to yank up my nerd rating by adding a re.NONDIACRITIC flag to the re module (matches letters independent of their accents), how would I go about? Should I subclass from re and implement the method, using the other existing methods as an example? I would find this a very useful addition.<br><br>As would lots of people, but it's not easy, and it's not even certain to me that it is always meaningful.<br><br>In English, and (I believe) Dutch, diacritics are modifiers, and so accented letters like ้ are considered just a variant of e. (This is not surprising, for English is closely related to Dutch.) But this is not a general rule -- many languages which use diacritics consider the accented
letters to be as distinct.<br><br><span>See, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/Diacritic">http://en.wikipedia.org/wiki/Diacritic</a> for a discussion of which diacritics are considered different letters and which are not.</span><br><br>You might like to read this:<br><br><span><a target="_blank" href="http://www.regular-expressions.info/unicode.html">http://www.regular-expressions.info/unicode.html</a></span><br><br>You can also look at "The Unicode Hammer" (a.k.a. "The Stupid American") recipe:<br><span><a target="_blank" href="http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/">http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/</a></span><br><br>Also this snippet:<br><br><span><a target="_blank" href="http://snippets.dzone.com/posts/show/5499">http://snippets.dzone.com/posts/show/5499</a></span><br><br><br>As for modifying the regex engine itself, it is written in
C and is quite a complex beast. It's not just a matter of subclassing it, but feel free to try:<br><br><br>>>> x = re.compile("x")<br>>>> type(x)<br><class '_sre.SRE_Pattern'><br><br>First warning -- the engine itself is flagged as a private implementation detail!<br><br>>>> import _sre<br>>>> class X(_sre.SRE_Pattern):<br>... pass<br>...<br>Traceback (most recent call last):<br> File "<stdin>", line 1, in <module><br>AttributeError: 'module' object has no attribute 'SRE_Pattern'<br><br>Second warning -- the regex class is not available from the top level of the module.<br><br>>>> class X(type(x)):<br>... pass<br>...<br>Traceback (most recent call last):<br> File "<stdin>", line 1, in <module><br>TypeError: type '_sre.SRE_Pattern' is not an acceptable base type<br><br><br>And the final problem: the C-based type is not suitable for
subclassing in pure Python. You could try using delegation, or working on it in C.<br><br>Good luck!<br><br><br>-- Steven<br><br>_______________________________________________<br>Tutor maillist - <a ymailto="mailto:Tutor@python.org" href="mailto:Tutor@python.org">Tutor@python.org</a><br>To unsubscribe or change subscription options:<br><span><a target="_blank" href="http://mail.python.org/mailman/listinfo/tutor">http://mail.python.org/mailman/listinfo/tutor</a></span><br></div></div></div>
</div><br>
</body></html>