[issue8465] Backreferences vs. escapes: a silent failure solved
report at bugs.python.org
Mon Apr 19 23:11:17 CEST 2010
New submission from Aaron Sherman <ajs at ajs.com>:
I tested this under 2.6 and 3.1. Under both, the common mistake that I'm sure many others have made, and which cost me quite some time today was:
re.sub(r'(foo)bar', '\1baz', 'foobar')
It's obvious, I'm sure, to many reading this that the second "r" was left out before the replacement spec. It's probably obvious that this is going to happen quite a lot, and there are many edge cases which are equally baffling to the uninitiated (e.g. \8, \418 and \1111)
In order to avoid this, I'd like to request that such usage be deprecated, leaving only numeric escapes of the form matched by r'\\[0-7][0-7][0-7]?(?!\d)' as valid, non-deprecated uses (e.g. \01 or \111 are fine). Let's look at what that would do:
Right now, the standard library uses escape sequences with \n where n is a single digit in a handful of places like sndhdr.py and difflib.py. These are certainly not widespread enough to consider this a common usage, but certainly those few would have to change to add a leading zero before the digit.
OK, so the specific requested feature is that \xxx produces a warning where xxx is:
* any single digit or
* any invalid sequence of two or three digits (e.g containing 8 or 9) or
* any sequence of 4 or more digits
... guiding the user to the more explicit \01, \x01 or, if they intended a literal backslash, the r notation.
If you wish to go a step further, I'd suggest adding a no-op escape \e such that:
would print "!1". Otherwise, there's no clean way to halt the interpretation of a digit-based escape sequence.
components: Regular Expressions, Unicode
title: Backreferences vs. escapes: a silent failure solved
type: feature request
versions: Python 2.6, Python 3.1
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list