[Python-Dev] Omission in re.sub?

Mon Dec 12 04:14:48 CET 2011

On Sun, Dec 11, 2011 at 2:36 PM, MRAB <python at mrabarnett.plus.com> wrote:
> On 11/12/2011 21:04, Guido van Rossum wrote:
>>
>> On Sun, Dec 11, 2011 at 12:47 PM, MRAB<python at mrabarnett.plus.com>  wrote:
>>>
>>> On 11/12/2011 20:27, Guido van Rossum wrote:
>>>>
>>>>
>>>> On Sun, Dec 11, 2011 at 12:12 PM, MRAB<python at mrabarnett.plus.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> I've just come across an omission in re.sub which I hadn't noticed
>>>>> before.
>>>>>
>>>>> In re.sub the replacement string can contain escape sequences, for
>>>>> example:
>>>>>
>>>>>>>> repr(re.sub(r"x", r"\n", "axb"))
>>>>>
>>>>>
>>>>> "'a\\nb'"
>>>>>
>>>>> However:
>>>>>
>>>>>>>> repr(re.sub(r"x", r"\x0A", "axb"))
>>>>>
>>>>>
>>>>> "'a\\\\x0Ab'"
>>>>>
>>>>> Yes, it doesn't recognise "\xNN".
>>>>>
>>>>> Is there a reason for this?
>>>>>
>>>>> The regex module does the same, but is there any objection to me
>>>>> fixing it in the regex module? (I'm thinking about compatibility
>>>>> with re here.)
>>>>
>>>>
>>>>
>>>> As long as there's a way to place a single backslash in the output
>>>> this seems fine to me, though I'm not sure it's important. Of course
>>>> it will likely break some test... the test will then have to be
>>>> fixed.
>>>>
>>>> I can't remember why we did this -- is there a full list of all the
>>>> escapes that re.sub() interprets somewhere? I thought it was pretty
>>>> limited. Maybe it's the related list of escapes that are supported
>>>> in regular expressions?
>>>>
>>> The documentation says: """That is, \n is converted to a single newline
>>> character, \r is converted to a linefeed, and so forth."""
>>>
>>> All of the other escape sequences work as expected, except for \uNNNN
>>> and \UNNNNNNNN which aren't supported at all in re.
>>>
>>> I should probably also add \N{...} to the list for completeness.
>>>
>> I guess the current rule is that any escapes referring to characters
>> by a numeric value are not supported; this probably made some kind of
>> sense because \1 etc. are backreferences. But since we're discouraging
>> octal escapes anyway I think it's fine to improve over this.
>>
> A pattern can contain them, even octal escapes (must be 3 digits).

Fine, then I think we should model this. Though I think that we could
start deprecating octal escapes in patterns so that eventually we can
support over 99 backreferences. So maybe we should just not start
supporting octal in the substitution string now.

-- 
--Guido van Rossum (python.org/~guido)