[Tutor] Question regular expressions - the non-greedy pattern

Tue Jan 22 00:11:52 CET 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Hugo, hello Walter,

first thank you very much for the quick reply.

The functions used here i.e. re.match() are taken directly form the
example in the mentioned HowTo. I'd rather use re.findall() but I
think the general interpretetion of the given regexp sould be nearly
the same in both functions.

So I'd like to neglect the choise of a particular function for a
moment a concentrate on the pure theory.
What I got so far:
in theory form s = '<<html><head><title>Title</title>'
'<.*?>' would match '<html>' '<head>' '<title>' '</title>'
to achieve this the engine should:
1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. walk backward form that point (the one of >) until it finds <
4. return the string between < from 3. and > from 2. as this gives the
least possible string between < and >

Did I get this right so far? Is this (=least possible string between <
and >), what non-greedy really translates to?

For some reason, I did not get so far the regexp engine in Python
omits step 3. and returns the string between < from 1. and > from 2.
resulting in '<<html>'

Am I right? If so, is there an easily graspable reason for the engine
designers to implement it this way?

If I'm wrong, where is my fault?

Marcin

Am 21.01.2013 17:23, schrieb Walter Prins:
> Hi,
> 
> 
> 
> On 21 January 2013 14:45, Marcin Mleczko <Marcin.Mleczko at onet.eu 
> <mailto:Marcin.Mleczko at onet.eu>> wrote:
> 
> Did I get the concept of non-greedy wrong or is this really a bug?
> 
> 
> Hugo's already explained the essence of your problem, but just to 
> add/reiterate:
> 
> a) match() will match at the beginning of the string (first
> character) or not at all.  As specified your regex does in fact
> match from the first character as shown so the result is correct.
> (Aside, "<html>" in "<<html>" does not in fact match *from the
> beginning of the string* so is besides the point for the match()
> call.)
> 
> b) Changing your regexp so that the body of the tag *cannot*
> contain "<", and then using search() instead, will fix your
> specific case for you:
> 
> import re
> 
> s = '<<html><head><title>Title</title>' tag_regex = '<[^<]*?>'
> 
> matchobj = re.match(tag_regex, s) print "re.match() result:",
> matchobj # prints None since no match at start of s
> 
> matchobj = re.search(tag_regex, s) # prints something since regex
> matches at index 1 of string print "re.search() result:\n", print
> "span:", matchobj.span() print "group:", matchobj.group()
> 
> 
> Walter
> 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJQ/cs4AAoJEDAt44dGkgj1CSUH/iT7b7jKafu8ugXGlNiLtISy
Abt6GcAZuwxeuokH7dna4FGA54x5BZzjrglu+VWrRJx8hsherL04Qt216V725Tpx
SN4IgLtK+AYAuhI73iBvyWK51vOTkWDzLrs6DYjNEWohw+n9QEtZVEkgMej/p760
6YDs8lbrHxVqUGiFTQr+vpCb6W85sOr+RlfkBsFibC3S17wRNVtaYWITc85I5Dfr
lLBh2kPzi9ITKPIFag4GRNzj1rWtp0NUGGAjyhmgijdl2GbiCLAGteJGoUvajOa1
889UuPItCi4zVJ5PJv0PDej8eD0ppd+k0rRHQK3SgaSgtTDgviGOvs3Ch4A9/Sk=
=Qo8U
-----END PGP SIGNATURE-----