[Tutor] superscripts in a regex

Wed Jul 31 14:24:59 CEST 2013

Albert-Jan Roskam wrote:

> In the script below I want to filter out the digits and I do not want to
> retain the decimal grouping symbol, if there are any. The weird thing is
> that re.findall returns the expected result (group 1 with digits and
> optionally group2 too), but re.sub does not (it just returns the entire
> string). I tried using flags re.LOCALE, re.UNICODE, and re.DEBUG for
> solutions/clues, but no luck

> regex = "(^\d+)[.,]?(\d*)[ \w]+"
> surfaces = ["79 m\xb2", "1.000 m\xb2", "2,000 m\xb2"]

> print re.sub(regex, r"\1\2", surface)  # huh?!
> print re.findall(regex, surface)  # works as expected

Instead of "huh?!" I would have appreciated a simple

Did... Expected... But got... Why?

> It's a no-no to ask this (esp. because it concerns a builtin) but: is this
> a b-u-g?

No bug. Let's remove all the noise from your exposition. Then we get

>>> re.sub("(a+)b?(c+)d*", r"\1\2", "aaaabccdddeee")
'aaaacceee'
>>> re.findall("(a+)b?(c+)d*", "aaaabccdddeee")
[('aaaa', 'cc')]

The 'e's are left alone as they are not matched by the regexp. The fix 
should be obvious, include them in the bytes allowed after group #2:

>>> re.sub("(a+)b?(c+)[de]*", r"\1\2", "aaaabccdddeee")
'aaaacc'

Translating back to your regex, The byte "\xb2" is not matched by r"[ \w]":

>>> re.findall(r"[ \w]", "\xb2")
[]

Include it explictly (why no $, by the way?)

>>> re.sub(r"(^\d+)[.,]?(\d*)[ \w\xb2]+", r"\1\2", "1.000 m\xb2")
'1000'

or implicitly

>>> re.sub(r"(^\d+)[.,]?(\d*)\D+", r"\1\2", "1.000 m\xb2")
'1000'

and you are golden.

PS: I'll leave nudging you to use unicode instead of byte strings to someone 
else. Only so much (on a console using utf-8):

>>> re.findall("[¹]", "¹²³")
['\xc2', '\xb9', '\xc2', '\xc2']
>>> print "".join(_)
¹��

>>> re.findall(u"[¹]", u"¹²³")
[u'\xb9']
>>> print _[0]
¹