regular expresions and dolar sign ($) in source string

Peter Otten __peter__ at web.de
Thu Apr 23 03:13:17 EDT 2009


Jax wrote:

> I encountered problem with dolar sign in source string. It seems that $
> require special threatening. Below is copy of session with interactive
> Python's shell:
> 
> Python 2.5.2 (r252:60911, Jan  8 2009, 12:17:37)
> [GCC 4.3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import re
>>>> a = unicode(r"(instead of $399.99)", "utf8")
>>>> print re.search(unicode(r"^\(instead of.*(\d+[.]\d+)\)$", "utf8"),
> a).group(1)
> 9.99
>>>> print re.search(unicode(r"^\(.*(\d+[.]\d+)\)$", "utf8"), a).group(1)
> 9.99
>>>> print re.search(unicode(r"^\(.*\$(\d+[.]\d+)\)$", "utf8"), a).group(1)
> 399.99
> 
> My question is: Why only third regular expression is correct?

They are all correct, they just don't give what you expect. This has nothing
to do with the $. The ".*" expression is "greedy", it tries to match as
many characters as possible. You can see that by adding another group:

>>> a = u"(instead of $399.99)"
>>> re.search(ur"^\(instead of(.*)(\d+[.]\d+)\)$", a).groups()
(u' $39', u'9.99')

Fortunately there is also a non-greedy variant ".*?" which matches as few
characters as possible:

>>> a = u"(instead of $399.99)"
>>> re.search(ur"^\(instead of.*?(\d+[.]\d+)\)$", a).group(1)
u'399.99'

Peter



More information about the Python-list mailing list