Regular expressions, help?

Cameron Simpson cs at zip.com.au
Thu Apr 19 02:47:33 EDT 2012


On 18Apr2012 23:11, Sania <fantasyblue82 at gmail.com> wrote:
| So I am trying to get the number of casualties in a text. After 'death
| toll' in the text the number I need is presented as you can see from
| the variable called text. Here is my code
| I'm pretty sure my regex is correct, I think it's the group part
| that's the problem.
| I am using nltk by python. Group grabs the string in parenthesis and
| stores it in deadnum and I make deadnum into a list.
| 
|  text="accounts put the death toll at 637 and those missing at
| 653 , but the total number is likely to be much bigger"

I presume you want the 637 and not the 653.

|       dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)

I always feel a little uncomfortable about double quotes and backslashes
(for all that the above is a "raw" string). Too much shell and C programming
perhaps. Anyway...

I would break this up like this:

    re_DEATH_TOLL = r".*death toll.*(\d[,\d\.]*)"
    print >>sys.stderr, "re_DEATH_TOLL =", re_DEATH_TOLL
    dead=re.match(re_DEATH_TOLL, text)

so I can print the raw text of the regexp _after_ python has parsed the
string.

Secondly, your regexp will match the wrong number, based on my
presumption above. Regexps are greedy and so your second ".*" will match
as much as possible while still matching the rest of the regexp. ANd
therefore if will match all the text before the 653, and grab the wrong
number.

Try (raw regexp):

    death toll\D*(\d+)
or
    death toll\D*(\d[\d,.]*)

and also use re.find instead of re.match; re.find will find the first
match anywhere in the string, avoiding complicating the regexp with a
leading ".*". \D is a non-digit. "+" means one or more like "*" means
zero or more.

Cheers
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

I'm not weird; I'm gifted.



More information about the Python-list mailing list