[Tutor] Question regular expressions - the non-greedy pattern

Tue Jan 22 12:30:16 CET 2013

Now I think I got it. Thanks a lot again.

Marcin

Am 22.01.2013 12:00, schrieb tutor-request at python.org:

> 
> Message: 1
> Date: Tue, 22 Jan 2013 11:31:01 +1100
> From: Steven D'Aprano <steve at pearwood.info>
> To: tutor at python.org
> Subject: Re: [Tutor] Question regular expressions - the non-greedy
> 	pattern
> Message-ID: <50FDDDC5.6030500 at pearwood.info>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
> On 22/01/13 10:11, Marcin Mleczko wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hello Hugo, hello Walter,
>>
>> first thank you very much for the quick reply.
>>
>> The functions used here i.e. re.match() are taken directly form the
>> example in the mentioned HowTo. I'd rather use re.findall() but I
>> think the general interpretetion of the given regexp sould be nearly
>> the same in both functions.
> 
> Regular expressions are not just "nearly" the same, they are EXACTLY
> the same, in whatever re function you call, with one exception:
> 
> re.match only matches at the start of the string.
> 
> 
>> So I'd like to neglect the choise of a particular function for a
>> moment a concentrate on the pure theory.
>> What I got so far:
>> in theory form s = '<<html><head><title>Title</title>'
>> '<.*?>' would match'<html>''<head>''<title>''</title>'
> 
> Incorrect. It will match
> 
> '<<html>'
> '<head>'
> '<title>'
> '</title>'
> 
> 
> Why don't you try it and see?
> 
> py> s = '<<html><head><title>Title</title>'
> py> import re
> py> re.findall('<.*?>', s)
> ['<<html>', '<head>', '<title>', '</title>']
> 
> 
> The re module is very stable. The above is what happens in every Python
> version between *at least* 1.5 and 3.3.
> 
> 
>> to achieve this the engine should:
>> 1. walk forward along the text until it finds<
> 
> Correct. That matches the first "<".
> 
> 
>> 2. walk forward from that point until in finds>
> 
> Correct. That matches the first ">".
> 
> Since the regex has now found a match, it moves on to the next part
> of the regex. Since this regex is now complete, it is done, and
> returns what it has found.
> 
> 
>> 3. walk backward form that point (the one of>) until it finds<
> 
> Regexes only backtrack on *misses*, not once they successfully find
> a match. Once a regex has found a match, it is done.
> 
> 
>> 4. return the string between<  from 3. and>  from 2. as this gives the
>> least possible string between<  and>
> 
> Incorrect.
> 
> 
>> Did I get this right so far? Is this (=least possible string between<
>> and>), what non-greedy really translates to?
> 
> No. The ".*" regex searches forward as far as possible; the ".*?" searches
> forward as little as possible. They do not backtrack.
> 
> The only time a non-greedy regex will backtrack is if the greedy version
> will backtrack. Since ".*" has no reason to backtrack, neither does ".*?".
> 
> 
>> For some reason, I did not get so far the regexp engine in Python
>> omits step 3. and returns the string between<  from 1. and>  from 2.
>> resulting in '<<html>'
>>
>> Am I right? If so, is there an easily graspable reason for the engine
>> designers to implement it this way?
> 
> Because that's the way regexes work. You would need to learn about
> regular expression theory, which is not easy material. But you can start
> here:
> 
> http://en.wikipedia.org/wiki/Regular_expression
> 
> and for more theoretical approach:
> 
> http://en.wikipedia.org/wiki/Chomsky_hierarchy
> http://en.wikipedia.org/wiki/Regular_language
> http://en.wikipedia.org/wiki/Regular_grammar
> 
> If you don't understand all the theory, don't worry, neither do I.
> 
> 
>