<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman,new york,times,serif;font-size:12pt">Hi Steven,<br><br>Thanks a BUNCH for helping me! Yes, you were correct in assuming that my input data are already names. They're names in a column in a csv file.&nbsp; They're the names of GPs, in various formats. Besides the forms I've mentioned already there are examples such as 'Doctor's office Duh, J. &amp; Dah, J.', with or without initials and/or connecting words. There are also offices with names as Doctor's Office 'Under the Oaks'. I want to normalise those cases too till  'Doctor's office J. Duh &amp; J. Dah', etc. Currently I use " &amp; ".split() and apply my regexes (I use three, so I will certainly study your very fancy function!).<br><br><div>So the raw string \b means means "ASCII backspace". Is that another way of saying that it means 'Word boundary'?<br><br>You're right: debugging

 regexes is a PIA. One teeny weeny mistake makes all the difference. Could one say that, in general, it's better to use a Divide and Conquer strategy and use a series of regexes and other string operations to reach one's goal?<br><br><span><a target="_blank" href="http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/">http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/</a> is interesting. I did something similar with&nbsp; unicode.translate(). Many people here have their keyboard settings as US, so accented letters are not typed very easily, and are therefore likely to be omitted (e.g. enqu</span><font face="Liberation Serif, serif">ête vs enquete).</font>

<br><br>Thanks again!<br><br></div>Cheers!!<br>Albert-Jan<br><br><div>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br>All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?<br>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><br><div style="font-family: arial,helvetica,sans-serif; font-size: 13px;"><font face="Tahoma" size="2"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Steven D'Aprano &lt;steve@pearwood.info&gt;<br><b><span style="font-weight: bold;">To:</span></b> Python Mailing List &lt;tutor@python.org&gt;<br><b><span style="font-weight: bold;">Sent:</span></b> Fri, February 18, 2011 4:45:42 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re:

 [Tutor] regex questions<br></font><br>

Albert-Jan Roskam wrote:<br>&gt; Hello,<br>&gt; <br>&gt; I have a couple of regex questions:<br>&gt; <br>&gt; 1 -- In the code below, how can I match the connecting words 'van de' , 'van der', etc. (all quite common in Dutch family names)?<br><br>You need to step back a little bit and ask, what is this regex supposed to accomplish? What is your input data? Do you expect this to tell the difference between van used as a connecting word in a name, and van used otherwise?<br><br>In other words, do you want:<br><br>re.search(???, "J. van Meer")&nbsp; # matches<br>re.search(???, "The van stopped")&nbsp; # doesn't match<br><br>You might say, "Don't be silly, of course not!" *but* if you expect this regex to detect names in arbitrary pieces of text, that is exactly what you are hoping for. It is beyond the powers of a regex alone to distinguish between arbitrary text containing a name:<br><br>"... and to my nephew Johann van Meer I leave my collection of

books..." and arbitrary text without a name: "... and the white van merely touched the side of the building..." You need a proper parser for that. I will assume that your input data will already be names, and you just want to determine the connecting words: van der van den van de van wherever they appear. That's easy: the only compulsory part is "van": pattern = r"\bvan\b( de[rn]?)?" Note the use of a raw string. Otherwise, \b doesn't mean "backslash b", but instead means "ASCII backspace". Here's a helper function for testing: def search(pattern, string): &nbsp; &nbsp; mo = re.search(pattern, string, re.IGNORECASE) &nbsp; &nbsp; if mo: &nbsp; &nbsp; &nbsp; &nbsp; return mo.group(0) &nbsp; &nbsp; return "--no match--" And the result is: &gt;&gt;&gt; names = ["J. van der Meer", "J. van den Meer", "J. van Meer", ... "Meer, J. van der", "Meer, J.

 van den", "Meer, J. van de",<br>... "Meer, J. van"]<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; for name in names:<br>...&nbsp; &nbsp;  print search(pattern, name)<br>...<br>van der<br>van den<br>van<br>van der<br>van den<br>van de<br>van<br><br>Don't forget to test things which should fail:<br><br>&gt;&gt;&gt; search(pattern, "Fred Smith")<br>'--no match--'<br>&gt;&gt;&gt; search(pattern, "Fred Vanderbilt")<br>'--no match--'<br><br><br><br>&gt; 2 -- It is quite hard to make a regex for all surnames, but easier to make <br><br>"\b[a-z]+[-']?[a-z]*\b" should pretty much match all surnames using only English letters, apostrophes and hyphens. You can add in accented letters as need.<br><br>(I'm lazy, so I haven't tested that.)<br><br><br>&gt; regexes for the initials and the connecting words. How could I ' subtract'&nbsp; those two regexes to end up with something that matches the surnames (I used two .replaces() in my code, which roughly work, but I'm thinking

 there's a re way to do it, perhaps with carets (^).<br><br>Don't try to use regexes to do too much. Regexes are a programming language, but the syntax is crap and there's a lot they can't do. They make a good tool for *parts* of your program, not the whole thing!<br><br>The best approach, I think, is something like this:<br><br><br>def detect_dutch_name(phrase):<br>&nbsp; &nbsp; """Return (Initial, Connecting-words, Surname) from a potential<br>&nbsp; &nbsp; Dutch name in the form "Initial [Connecting-words] Surname" or<br>&nbsp; &nbsp; "Surname, Initial Connecting-words".<br>&nbsp; &nbsp; """<br>&nbsp; &nbsp; pattern = (&nbsp; r"(?P&lt;surname&gt;.*?), "<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  r"(?P&lt;initial&gt;[a-z]\.) ?(?P&lt;connect&gt;van (de[rn]?))?"&nbsp; )<br>&nbsp; &nbsp; mo = re.match(pattern, phrase, re.IGNORECASE)<br>&nbsp; &nbsp; if mo:<br>&nbsp; &nbsp; &nbsp; &nbsp; return (mo.group('initial'), mo.group('connect') or

 '',<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mo.group('surname'))<br>&nbsp; &nbsp; # Try again.<br>&nbsp; &nbsp; pattern = (&nbsp; r"(?P&lt;initial&gt;[a-z]\.) "<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  r"(?P&lt;connect&gt;van (de[rn]? ))?(?P&lt;surname&gt;.*)"&nbsp; )<br>&nbsp; &nbsp; # Note: due to laziness, I accept any character at all in surnames.<br>&nbsp; &nbsp; mo = re.match(pattern, phrase, re.IGNORECASE)<br>&nbsp; &nbsp; if mo:<br>&nbsp; &nbsp; &nbsp; &nbsp; return (mo.group('initial'), mo.group('connect') or '',<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mo.group('surname'))<br>&nbsp; &nbsp; return ('', '', '')<br><br>Note that this is BUGGY -- it doesn't do exactly what you want, although it is close:<br><br>&gt;&gt;&gt; detect_dutch_name("Meer, J. van der")&nbsp; # Works fine<br>('J.', 'van der', 'Meer')<br><br>but:<br><br>&gt;&gt;&gt; detect_dutch_name("J. van der Meer")&nbsp; # almost,

 except for the space<br>('J.', 'van der ', 'Meer')<br>&gt;&gt;&gt; detect_dutch_name("J. van Meer")&nbsp; # not so good<br>('J.', '', 'van Meer')<br><br>Debugging regexes is a PITA and I'm going to handball the hard part to you :)<br><br><br>&gt; 3 -- Suppose I want to yank up my nerd rating by adding a re.NONDIACRITIC flag to the re module (matches letters independent of their accents), how would I go about? Should I subclass from re and implement the method, using the other existing methods as an example? I would find this a very useful addition.<br><br>As would lots of people, but it's not easy, and it's not even certain to me that it is always meaningful.<br><br>In English, and (I believe) Dutch, diacritics are modifiers, and so accented letters like é are considered just a variant of e. (This is not surprising, for English is closely related to Dutch.) But this is not a general rule -- many languages which use diacritics consider the accented

 letters to be as distinct.<br><br><span>See, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/Diacritic">http://en.wikipedia.org/wiki/Diacritic</a> for a discussion of which diacritics are considered different letters and which are not.</span><br><br>You might like to read this:<br><br><span><a target="_blank" href="http://www.regular-expressions.info/unicode.html">http://www.regular-expressions.info/unicode.html</a></span><br><br>You can also look at "The Unicode Hammer" (a.k.a. "The Stupid American") recipe:<br><span><a target="_blank" href="http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/">http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/</a></span><br><br>Also this snippet:<br><br><span><a target="_blank" href="http://snippets.dzone.com/posts/show/5499">http://snippets.dzone.com/posts/show/5499</a></span><br><br><br>As for modifying the regex engine itself, it is written in

 C and is quite a complex beast. It's not just a matter of subclassing it, but feel free to try:<br><br><br>&gt;&gt;&gt; x = re.compile("x")<br>&gt;&gt;&gt; type(x)<br>&lt;class '_sre.SRE_Pattern'&gt;<br><br>First warning -- the engine itself is flagged as a private implementation detail!<br><br>&gt;&gt;&gt; import _sre<br>&gt;&gt;&gt; class X(_sre.SRE_Pattern):<br>...&nbsp; &nbsp;  pass<br>...<br>Traceback (most recent call last):<br>&nbsp; File "&lt;stdin&gt;", line 1, in &lt;module&gt;<br>AttributeError: 'module' object has no attribute 'SRE_Pattern'<br><br>Second warning -- the regex class is not available from the top level of the module.<br><br>&gt;&gt;&gt; class X(type(x)):<br>...&nbsp; &nbsp;  pass<br>...<br>Traceback (most recent call last):<br>&nbsp; File "&lt;stdin&gt;", line 1, in &lt;module&gt;<br>TypeError: type '_sre.SRE_Pattern' is not an acceptable base type<br><br><br>And the final problem: the C-based type is not suitable for

 subclassing in pure Python. You could try using delegation, or working on it in C.<br><br>Good luck!<br><br><br>-- Steven<br><br>_______________________________________________<br>Tutor maillist&nbsp; -&nbsp; <a ymailto="mailto:Tutor@python.org" href="mailto:Tutor@python.org">Tutor@python.org</a><br>To unsubscribe or change subscription options:<br><span><a target="_blank" href="http://mail.python.org/mailman/listinfo/tutor">http://mail.python.org/mailman/listinfo/tutor</a></span><br></div></div></div>

</div><br>

      </body></html>