[regex] How to check for non-space character?

John Machin sjmachin at lexicon.net
Sat Mar 21 15:33:30 CET 2009

Gilles Ganault <nospam <at> nospam.com> writes:

> Hello
> Some of the adresses are missing a space between the streetname and
> the ZIP code, eg. "123 Main Street01159 Someville"

This problem appears very similar to the one you had in a previous episode,
where you were deleting <br /> in address contexts where it obviously should
have been treated as importantly as a comma or even (would you believe) a line

The example botched output was "... St Johns WoodLondon ..." IIRC.

Prevention is better than cure; try to find out if your earlier code is causing
this problem.

> The following regex doesn't seem to work:

Regexes do work. If the outcome is not what you expected, it is your
eexpectation-to-regex translator that is not working.

What does it do? Does it match zero addresses, all addresses, many addresses
that contain a 5-digit number /followed/ by a space, something else? Could you
use the answer to that question to narrow in on the problem with your regex?

> #Check for any non-space before a five-digit number
> re_bad_address = re.compile('([^\s].)(\d{5}) ',re.I | re.S | re.M)

The comment is quite incorrect. After removing the fog of useless parentheses,
the regex says:
[^\s] -- one non-whitespace character (better written as \S)
. -- any character (more or less, see later) (why?)
\d{5} -- 5 digits
  -- a space (why?)

Then there's a hail of flags:
re.I (ignore case) -- irrelevant
re.S (DOTALL) -- makes your pointless . match any character (instead of any
character except newline) Do you have any newlines in your addresses?
re.M (MULTILINE) -- I'm 99% sure you don't need this either.

> I also tried ([^ ].), to no avail.

If not-whitespace doesn't match, changing it to not-space doesn't help.

> What is the right way to tell the Python re module to check for any
> non-space character?

r'[^ ]' -- but that's NOT the question you should be asking.


More information about the Python-list mailing list