[Tutor] superscripts in a regex
Peter Otten
__peter__ at web.de
Wed Jul 31 14:24:59 CEST 2013
Albert-Jan Roskam wrote:
> In the script below I want to filter out the digits and I do not want to
> retain the decimal grouping symbol, if there are any. The weird thing is
> that re.findall returns the expected result (group 1 with digits and
> optionally group2 too), but re.sub does not (it just returns the entire
> string). I tried using flags re.LOCALE, re.UNICODE, and re.DEBUG for
> solutions/clues, but no luck
> regex = "(^\d+)[.,]?(\d*)[ \w]+"
> surfaces = ["79 m\xb2", "1.000 m\xb2", "2,000 m\xb2"]
> print re.sub(regex, r"\1\2", surface) # huh?!
> print re.findall(regex, surface) # works as expected
Instead of "huh?!" I would have appreciated a simple
Did... Expected... But got... Why?
> It's a no-no to ask this (esp. because it concerns a builtin) but: is this
> a b-u-g?
No bug. Let's remove all the noise from your exposition. Then we get
>>> re.sub("(a+)b?(c+)d*", r"\1\2", "aaaabccdddeee")
'aaaacceee'
>>> re.findall("(a+)b?(c+)d*", "aaaabccdddeee")
[('aaaa', 'cc')]
The 'e's are left alone as they are not matched by the regexp. The fix
should be obvious, include them in the bytes allowed after group #2:
>>> re.sub("(a+)b?(c+)[de]*", r"\1\2", "aaaabccdddeee")
'aaaacc'
Translating back to your regex, The byte "\xb2" is not matched by r"[ \w]":
>>> re.findall(r"[ \w]", "\xb2")
[]
Include it explictly (why no $, by the way?)
>>> re.sub(r"(^\d+)[.,]?(\d*)[ \w\xb2]+", r"\1\2", "1.000 m\xb2")
'1000'
or implicitly
>>> re.sub(r"(^\d+)[.,]?(\d*)\D+", r"\1\2", "1.000 m\xb2")
'1000'
and you are golden.
PS: I'll leave nudging you to use unicode instead of byte strings to someone
else. Only so much (on a console using utf-8):
>>> re.findall("[¹]", "¹²³")
['\xc2', '\xb9', '\xc2', '\xc2']
>>> print "".join(_)
¹��
>>> re.findall(u"[¹]", u"¹²³")
[u'\xb9']
>>> print _[0]
¹
More information about the Tutor
mailing list