re.sub unexpected behaviour

Tue Jul 6 13:32:29 EDT 2010

On 07/06/2010 07:10 PM, Javier Collado wrote:
> Hello,
> 
> Let's imagine that we have a simple function that generates a
> replacement for a regular expression:
> 
> def process(match):
>     return match.string
> 
> If we use that simple function with re.sub using a simple pattern and
> a string we get the expected output:
> re.sub('123', process, '123')
> '123'
> 
> However, if the string passed to re.sub contains a trailing new line
> character, then we get an extra new line character unexpectedly:
> re.sub(r'123', process, '123\n')
> '123\n\n'

process returns match.string, which is, according to the docs:

"""The string passed to match() or search()"""

You passed "123\n" to sub(), which may not be explicitly listed here,
but there's no difference. Process correctly returns "123\n", which is
inserted. Let me demonstrate again with a longer string:

>>> import re
>>> def process(match):
...     return match.string
...
>>> re.sub(r'\d+', process, "start,123,end")
'start,start,123,end,end'
>>>

> 
> If we try to get the same result using a replacement string, instead
> of a function, the strange behaviour cannot be reproduced:
> re.sub(r'123', '123', '123')
> '123'
> 
> re.sub('123', '123', '123\n')
> '123\n'

Again, the behaviour is correct: you're not asking for "whatever was
passed to sub()", but for '123', and that's what you're getting.

> 
> Is there any explanation for this? If I'm skipping something when
> using a replacement function with re.sub, please let me know.

What you want is grouping:

>>> def process(match):
...     return "<<" + match.group(1) + ">>"
...
>>> re.sub(r'(\d+)', process, "start,123,end")
'start,<<123>>,end'
>>>

or better, without a function:

>>> re.sub(r'(\d+)', r'<<\1>>', "start,123,end")
'start,<<123>>,end'
>>>

Cheers,
Thomas