[Tutor] Question regular expressions - the non-greedy pattern

Steven D'Aprano steve at pearwood.info
Tue Jan 22 01:31:01 CET 2013


On 22/01/13 10:11, Marcin Mleczko wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello Hugo, hello Walter,
>
> first thank you very much for the quick reply.
>
> The functions used here i.e. re.match() are taken directly form the
> example in the mentioned HowTo. I'd rather use re.findall() but I
> think the general interpretetion of the given regexp sould be nearly
> the same in both functions.

Regular expressions are not just "nearly" the same, they are EXACTLY
the same, in whatever re function you call, with one exception:

re.match only matches at the start of the string.


> So I'd like to neglect the choise of a particular function for a
> moment a concentrate on the pure theory.
> What I got so far:
> in theory form s = '<<html><head><title>Title</title>'
> '<.*?>' would match'<html>''<head>''<title>''</title>'

Incorrect. It will match

'<<html>'
'<head>'
'<title>'
'</title>'


Why don't you try it and see?

py> s = '<<html><head><title>Title</title>'
py> import re
py> re.findall('<.*?>', s)
['<<html>', '<head>', '<title>', '</title>']


The re module is very stable. The above is what happens in every Python
version between *at least* 1.5 and 3.3.


> to achieve this the engine should:
> 1. walk forward along the text until it finds<

Correct. That matches the first "<".


> 2. walk forward from that point until in finds>

Correct. That matches the first ">".

Since the regex has now found a match, it moves on to the next part
of the regex. Since this regex is now complete, it is done, and
returns what it has found.


> 3. walk backward form that point (the one of>) until it finds<

Regexes only backtrack on *misses*, not once they successfully find
a match. Once a regex has found a match, it is done.


> 4. return the string between<  from 3. and>  from 2. as this gives the
> least possible string between<  and>

Incorrect.


> Did I get this right so far? Is this (=least possible string between<
> and>), what non-greedy really translates to?

No. The ".*" regex searches forward as far as possible; the ".*?" searches
forward as little as possible. They do not backtrack.

The only time a non-greedy regex will backtrack is if the greedy version
will backtrack. Since ".*" has no reason to backtrack, neither does ".*?".


> For some reason, I did not get so far the regexp engine in Python
> omits step 3. and returns the string between<  from 1. and>  from 2.
> resulting in '<<html>'
>
> Am I right? If so, is there an easily graspable reason for the engine
> designers to implement it this way?

Because that's the way regexes work. You would need to learn about
regular expression theory, which is not easy material. But you can start
here:

http://en.wikipedia.org/wiki/Regular_expression

and for more theoretical approach:

http://en.wikipedia.org/wiki/Chomsky_hierarchy
http://en.wikipedia.org/wiki/Regular_language
http://en.wikipedia.org/wiki/Regular_grammar

If you don't understand all the theory, don't worry, neither do I.



-- 
Steven


More information about the Tutor mailing list