[Python-Dev] Discordance in documentation...

Guido van Rossum guido at python.net
Thu Sep 4 22:31:21 EDT 2003


> ...or is this just me?

No, there are many others who are just as misguided. :-)

> Let's take a look, Reference Lib, 4.2.1 Regular Expression Syntax says:
> 
>    "|"
>            A|B, where A and B can be arbitrary REs, creates a regular
>            expression that will match either A or B.
>            [...]
>            REs separated by "|" are tried from left to right, and the 
>            first one that allows the complete pattern to match is considered 
>            the accepted branch. This means that if A matches, B will never 
>            be tested, even if it would produce a longer overall match. [...]
> 
> And now a little test:
> 
> import re
> a = "Fuentes Rushdie Marquez"
> print re.search("Rushdie|Fuentes", a).group() # returns "Fuentes"
> 
> According to the documentation I suspected it will return "Rushdie" 
> rather than "Fuentes", but it looks like it returns first part of the
> string that matches rather than first part of regular expression.

There's probably an official answer for this, but maybe this helps you
to explain it: re.search("A|B", s) doesn't first search s for A and
then for B.  Rather, for each successive position of s, it searches
for a match *starting at that point* for "A|B".  So when using
search(), the first match wins.  The rule about which branch of the |
wins only applies when both match at the same point.

(In general, anchored matching is the more fundamental re operation,
and searching is done by trying to match at successive positions until
a match is found.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list