greedy match wanted

Thu Mar 3 12:50:23 EST 2005

alexk wrote:
> My problem is as follows. I want to match urls, and therefore I have a
> group
> of long valid domain names in my regex:
> 
> .... (?:com|org|net|biz|info|ac|cc|gs|ms|
> 			 sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
> 			 com\.ag|ai|off\.ai|al|an|ao|aq|
> 			 com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...
> 
> However, for a url like kuku.com.to it matches the kuku.com part,
> while I want it to match the whole kuku.com.to. Notice that both "com"
> and "com.to" are present in the group above.
> 
> 1. How do I give precedence for "com.to" over "com" in the above group
> ?
> Maybe I can somehow sort it by lexicographic order and then by length,
> or divide it to a set of sub-groups by length ?

According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one 
pattern completely matches, that branch is accepted. This means that once A matches, B will not be 
tested further, even if it would produce a longer overall match. In other words, the "|" operator is 
never greedy."

So putting "com.to" before "com" does what you want.

  >>> import re
  >>> re.search(r'com|com\.to', 'kuku.com.to').group()
'com'
  >>> re.search(r'com\.to|com', 'kuku.com.to').group()
'com.to'

Kent