[Python-Dev] Documentation inconsistency in re

06 Sep 2002 02:20:51 -0400

>From the Library Reference (2.2.1):

\b    Matches the empty string, but only at the beginning or end of a
      word. A word is defined as a sequence of alphanumeric characters, so
      the end of a word is indicated by whitespace or a non-alphanumeric
      character. Inside a character range, \b represents the backspace
      character, for compatibility with Python's string literals.

Now reality:

Python 2.2.1 (#2, Apr 22 2002, 17:53:10) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> t = re.compile(r'\bbag\b')
>>> t.search('test bag')
<_sre.SRE_Match object at 0x812aad0>
>>> t.search('test+bag')
<_sre.SRE_Match object at 0x815d528>
>>> t.search('test_bag')
>>> [ chr(i) for i in xrange(256) if not t.search('test' + chr(i) +
'bag') ]
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D',
'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e',
'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's',
't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> 

So the implementation appears to define a word as a sequence of
alphanumeric characters or underscores, which means either the
documentation, or the library is wrong.  Now it happens that this was
found while a friend of mine and I were looking to get the exact
behavior that is implemented, so I'd prefer it if the documentation
were updated to meet the implementation <.8 wink>.

-- 
Christopher A. Craig <list-python@ccraig.org>
I develop for Linux for a living, I used to develop for DOS.  Going from
DOS to Linux is like trading a glider for an F117. - Lawrence Foard