Finding # prefixing numbers
duncan.booth at invalid.invalid
Tue Jul 19 14:05:28 CEST 2005
peterbe at gmail.com wrote:
> In a text that contains references to numbers like this: #583 I want
> to find them with a regular expression but I'm having problems with
> the hash. Hopefully this code explains where I'm stuck:
>>>> import re
>>>> re.compile(r'\b(\d\d\d)\b').findall('#123 x (#234) or:#456 #6789')
> ['123', '234', '456']
>>>> re.compile(r'\b(X\d\d\d)\b').findall('X123 x (X234) or:X456 X6789')
> ['X123', 'X234', 'X456']
>>>> re.compile(r'\b(#\d\d\d)\b').findall('#123 x (#234) or:#456 #6789')
>>>> re.compile(r'\b(\#\d\d\d)\b').findall('#123 x (#234) or:#456
> As you can guess, I'm trying to find a hash followed by 3 digits word
> bounded. As in the example above, it wouldn't have been a problem if
> the prefix was an 'X' but that's not the case here.
>From the re documentation:
> Matches the empty string, but only at the beginning or end of a word.
> A word is defined as a sequence of alphanumeric or underscore
> characters, so the end of a word is indicated by whitespace or a
> non-alphanumeric, non-underscore character. Note that \b is defined as
> the boundary between \w and \ W, so the precise set of characters
> deemed to be alphanumeric depends on the values of the UNICODE and
> LOCALE flags. Inside a character range, \b represents the backspace
> character, for compatibility with Python's string literals.
# is not a letter or digit, so \b# will match only if the # is directly
preceded by a letter or digit which isn't the case in any of your examples.
Use \B (which is the opposite of \b) instead:
>>> re.compile(r'\B(#\d\d\d)\b').findall('#123 x (#234) or:#456 #6789')
['#123', '#234', '#456']
More information about the Python-list