[issue10713] re module doesn't describe string boundaries for \b

New submission from Ralph Corderoy ralph-pythonbugs@inputplus.co.uk:
The re module defines \b in a regexp to need \w one side and \W the other. What about when the end of the string or line is involved? perlre(1) says that's treated as a \W. Python should precisely document that case too.
---------- assignee: docs@python components: Documentation messages: 124097 nosy: docs@python, ralph.corderoy priority: normal severity: normal status: open title: re module doesn't describe string boundaries for \b
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Éric Araujo merwok@netwok.org added the comment:
Thanks for the report. Would you be interested in experimenting and/or reading the code to find the anwser and propose a doc patch?
---------- keywords: +easy nosy: +eric.araujo stage: -> needs patch versions: +Python 2.7, Python 3.1, Python 3.2, Python 3.3
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Changes by Ezio Melotti ezio.melotti@gmail.com:
---------- nosy: +ezio.melotti
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Ralph Corderoy ralph-pythonbugs@inputplus.co.uk added the comment:
Examining the source of Ubuntu's python2.6 2.6.6-5ubuntu1 package suggests beyond the limits of the string is considered \W, like Perl.
Modules/_sre.c: 336 LOCAL(int) 337 SRE_AT(SRE_STATE* state, SRE_CHAR* ptr, SRE_CODE at) 338 { 339 /* check if pointer is at given position */ 340 341 Py_ssize_t thisp, thatp; ... 365 case SRE_AT_BOUNDARY: 366 if (state->beginning == state->end) 367 return 0; 368 thatp = ((void*) ptr > state->beginning) ? 369 SRE_IS_WORD((int) ptr[-1]) : 0; 370 thisp = ((void*) ptr < state->end) ? 371 SRE_IS_WORD((int) ptr[0]) : 0; 372 return thisp != thatp;
SRE_IS_WORD() returns 16 for the 63 \w characters, 0 otherwise.
This is born out by tests.
Note, 366 above confirms it's never true for an empty string. The documentation states that \B "is just the opposite of \b" yet re.match(r'\b', '') returns None and so does \B so \B isn't the opposite of \b in all cases.
----------
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Changes by Ron Ridley ronster76@gmail.com:
---------- nosy: +Ron.Ridley
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Martin Pool mbp@sourcefrog.net added the comment:
Note, 366 above confirms it's never true for an empty string. The
documentation states that \B "is just the opposite of \b" yet re.match(r'\b', '') returns None and so does \B so \B isn't the opposite of \b in all cases.
This is also a bit strange if you follow the Perl line of reasoning of imagining there are non-word characters outside the string. And, indeed, in Perl,
"" =~ /\B/
is true.
So this patch adds some tests for \b behaviour and some docs. I think possible \B should actually change, but that would be a bigger (perhaps impossible?) change.
---------- keywords: +patch nosy: +poolie Added file: http://bugs.python.org/file22991/20110822-1604-re-docs.diff
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Ezio Melotti ezio.melotti@gmail.com added the comment:
This is a new patch based on Martin work. I don't think it's necessary to explain what happens while using r'\b' or r'\B' on an empty string in the doc -- that's not a common case and it might end up confusing users. I think however that a couple of examples might help them figuring out what they are useful for. Mentioning that they work with the beginning/end of the string too is a reasonable request, so I tweaked the doc to point that out.
---------- stage: needs patch -> patch review type: -> enhancement versions: -Python 3.1 Added file: http://bugs.python.org/file24661/issue10713.diff
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Éric Araujo merwok@netwok.org added the comment:
Like it.
----------
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Roundup Robot devnull@psf.upfronthosting.co.za added the comment:
New changeset fc89e09ca2fc by Ezio Melotti in branch '2.7': #10713: Improve documentation for \b and \B and add a few tests. Initial patch and tests by Martin Pool. http://hg.python.org/cpython/rev/fc89e09ca2fc
New changeset cde7fa40b289 by Ezio Melotti in branch '3.2': #10713: Improve documentation for \b and \B and add a few tests. Initial patch and tests by Martin Pool. http://hg.python.org/cpython/rev/cde7fa40b289
New changeset b78ca038e468 by Ezio Melotti in branch 'default': #10713: merge with 3.2. http://hg.python.org/cpython/rev/b78ca038e468
---------- nosy: +python-dev
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________

Ezio Melotti ezio.melotti@gmail.com added the comment:
Fixed, thanks for the patch!
---------- assignee: docs@python -> ezio.melotti resolution: -> fixed stage: patch review -> committed/rejected status: open -> closed
_______________________________________ Python tracker report@bugs.python.org http://bugs.python.org/issue10713 _______________________________________
participants (6)
-
Ezio Melotti
-
Martin Pool
-
Ralph Corderoy
-
Ron Ridley
-
Roundup Robot
-
Éric Araujo