[issue12162] Documentation about re \number

New submission from Seth Troisi <braintwo@gmail.com>: It would be nice to clarify re documentation on how to use \number. current documentation lists three half examples: "(.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group)." This is rather confusing (at least to me) as it might be assumed that re.search("(.+) \1", "the the") would return a match, which it does not. A better example would be re.search("(\w+) \\1", "the the") which does match. the other confusing portion is the requirement of the second "\" to make it match. I would think that a quick example below the text would help.
re.search("(\w+) \\1", "can you do the can can?") # \\1 matches the second can at the end of the sentence <_sre.SRE_Match object at ...>
This is my first python issue and if I have misfiled or left out some information please tell me how to proceed. ---------- assignee: docs@python components: Documentation messages: 136708 nosy: Seth.Troisi, docs@python priority: normal severity: normal status: open title: Documentation about re \number type: behavior _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Changes by Ezio Melotti <ezio.melotti@gmail.com>: ---------- nosy: +ezio.melotti _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

R. David Murray <rdmurray@bitdance.com> added the comment: Read the description of strings and raw strings at the top of the re documentation for the answer to your question about \\. It would probably be better if the example regular expression was written r'(.+) \1' instead of as a bare expression as it is now. ---------- nosy: +r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Seth Troisi <braintwo@gmail.com> added the comment: Given David Murray's input I think the example would be best done as
re.search(r'(\w+) \1', "can you do the can can?") # Matches the duplicate can <_sre.SRE_Match object at ...>
I want to stress that the documentation is not wrong but confusing, especially for someone unfamiliar with regression expressions. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Terry J. Reedy <tjreedy@udel.edu> added the comment: The doc consistently does NOT quote re's in the text. Rather, they are shaded gray, both in Windows help version and html version. So this one should not be treated differently. Most of the confusion reported is due to not reading the intro paragraphs. I almost suggested closing this without action. However, after saying to use the r prefix, the doc omits them from examples when not absolutely needed. In particular,
m = re.search('(?<=-)\w+', 'spam-egg')
Why does \w work without being doubled or protected (and it does, I checked), while \1 does not? Hell if I know. So even though that example works, it should be changed. The doc should teach the rule "if strings contains '\', prefix with 'r'" rather than "test and add 'r' if it fails", or "learn the exact list of when needed", which is not given and unknown to me and most any beginner. I advocate the same practice in the RE How To, which also has at least one example with '\' but without 'r':
p = re.compile('\d+')
I do not think we need another example other than those in the text. ---------- keywords: +patch nosy: +terry.reedy stage: -> needs patch versions: +Python 2.7, Python 3.1, Python 3.2, Python 3.3 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

R. David Murray <rdmurray@bitdance.com> added the comment: Why it works is due to a quirk in the handling of python strings: if an apparent escape sequence doesn't "mean anything", it is retained verbatim, including the '\' character. This is documented in http://docs.python.org/reference/lexical_analysis.html#string-literals: "Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)" It is *very* unwise to depend on this behavior for anything except debugging, therefore those examples which do are, in my opinion, wrong. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Ezio Melotti <ezio.melotti@gmail.com> added the comment: The regex sets (\d\w\s\D\W\S) don't match any Python escape sequence, so even if some suggest to always use r'' regardless, I don't find it necessary, especially for simple regexs. The two conflicting escape sequences to keep in mind are \b (backspace for Python, word boundary for re) and \number (octal escape for Python, reference to a group for re). There are also other regex escape sequences that are rarely used (\B\A\Z), but these don't need to be escaped either. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

R. David Murray <rdmurray@bitdance.com> added the comment: The fact that you have carefully think about which are escapes and which aren't tells you that you should not be depending on the non-escapes not being escapes. What if we added one? The doc says preserving the \s is a debugging aid, and that is all it should be used for, IMO. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Georg Brandl added the comment: I can't see the issue here. The RE docs are much better off with the regexes unquoted. The '(.+) \1' example was fixed today (the string supposed to not match actually did match). ---------- nosy: +georg.brandl resolution: -> works for me status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Changes by Ezio Melotti <ezio.melotti@gmail.com>: ---------- nosy: +ezio.melotti _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

R. David Murray <rdmurray@bitdance.com> added the comment: Read the description of strings and raw strings at the top of the re documentation for the answer to your question about \\. It would probably be better if the example regular expression was written r'(.+) \1' instead of as a bare expression as it is now. ---------- nosy: +r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Seth Troisi <braintwo@gmail.com> added the comment: Given David Murray's input I think the example would be best done as
re.search(r'(\w+) \1', "can you do the can can?") # Matches the duplicate can <_sre.SRE_Match object at ...>
I want to stress that the documentation is not wrong but confusing, especially for someone unfamiliar with regression expressions. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Terry J. Reedy <tjreedy@udel.edu> added the comment: The doc consistently does NOT quote re's in the text. Rather, they are shaded gray, both in Windows help version and html version. So this one should not be treated differently. Most of the confusion reported is due to not reading the intro paragraphs. I almost suggested closing this without action. However, after saying to use the r prefix, the doc omits them from examples when not absolutely needed. In particular,
m = re.search('(?<=-)\w+', 'spam-egg')
Why does \w work without being doubled or protected (and it does, I checked), while \1 does not? Hell if I know. So even though that example works, it should be changed. The doc should teach the rule "if strings contains '\', prefix with 'r'" rather than "test and add 'r' if it fails", or "learn the exact list of when needed", which is not given and unknown to me and most any beginner. I advocate the same practice in the RE How To, which also has at least one example with '\' but without 'r':
p = re.compile('\d+')
I do not think we need another example other than those in the text. ---------- keywords: +patch nosy: +terry.reedy stage: -> needs patch versions: +Python 2.7, Python 3.1, Python 3.2, Python 3.3 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

R. David Murray <rdmurray@bitdance.com> added the comment: Why it works is due to a quirk in the handling of python strings: if an apparent escape sequence doesn't "mean anything", it is retained verbatim, including the '\' character. This is documented in http://docs.python.org/reference/lexical_analysis.html#string-literals: "Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)" It is *very* unwise to depend on this behavior for anything except debugging, therefore those examples which do are, in my opinion, wrong. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Ezio Melotti <ezio.melotti@gmail.com> added the comment: The regex sets (\d\w\s\D\W\S) don't match any Python escape sequence, so even if some suggest to always use r'' regardless, I don't find it necessary, especially for simple regexs. The two conflicting escape sequences to keep in mind are \b (backspace for Python, word boundary for re) and \number (octal escape for Python, reference to a group for re). There are also other regex escape sequences that are rarely used (\B\A\Z), but these don't need to be escaped either. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

R. David Murray <rdmurray@bitdance.com> added the comment: The fact that you have carefully think about which are escapes and which aren't tells you that you should not be depending on the non-escapes not being escapes. What if we added one? The doc says preserving the \s is a debugging aid, and that is all it should be used for, IMO. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________

Georg Brandl added the comment: I can't see the issue here. The RE docs are much better off with the regexes unquoted. The '(.+) \1' example was fixed today (the string supposed to not match actually did match). ---------- nosy: +georg.brandl resolution: -> works for me status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue12162> _______________________________________
participants (5)
-
Ezio Melotti
-
Georg Brandl
-
R. David Murray
-
Seth Troisi
-
Terry J. Reedy