Help debugging code - Negative lookahead problem
Peter Otten
__peter__ at web.de
Sun Feb 26 13:57:22 EST 2017
michael.gauthier.uni at gmail.com wrote:
> Hi MRAB,
>
> Thanks for taking time to look at my problem!
>
> I tried your solution:
>
> r"\d{2}\s?(?=(?:years old\s?|yo\s?|yr old\s?|y o\s?|yrs old\s?|year
> old\s?)(?!son|daughter|kid|child))"
>
> but unfortunately it does seem not work. Also, I tried adding the negative
> lookaheads after every one of the alternatives, but it does not work
> either, so the problem does not seem to be that the negative lookahead
> applies only to the last proposition... : (
>
> Also, \d{2} will only match two single digits, and won't match the last
> two digits of 101, so at least this is fine! : )
>
> Any other idea to improve that code? I'm starting to get desperate...
If your code becomes too complex to manage it break it into simpler parts.
In this case you can use two simple regular expressions:
>>> age = re.compile(r"\d+")
>>> child = re.compile(r"\s+kid")
>>> text = "42 bar baz foo 12 kid"
>>> for candidate in age.finditer(text):
... if child.match(text, candidate.end()):
... print("Kid's age:", candidate.group())
... else:
... print("Author's age:", candidate.group())
...
Author's age: 42
Kid's age: 12
Applying that idea (and the principle to break everything into dead easy
parts) to your problem:
$ cat demo.py
import re
def longest_first(text):
return sorted(text.splitlines(), key=len, reverse=True)
YEARS = longest_first("""\
year
years
year old
years old
yo
ys o
""")
CHILDREN = longest_first("""\
son
daughter
kid
child
""")
YEARS_RE = r"\b(?P<age>\d+) ({})".format("|".join(YEARS))
re_years = re.compile(YEARS_RE)
CHILD_RE = r" ({})\b".format("|".join(CHILDREN))
re_child = re.compile(CHILD_RE)
def followed_by_child(candidate):
return re_child.match(candidate.string, candidate.end())
CORPUS = """\
jester, 42 years old, 20 years kidding
12 years kid
engineer, 30 years
engineer, 30 years old daughter
""".splitlines()
for text in CORPUS:
print(text)
for m in re_years.finditer(text):
age = m.group("age")
if followed_by_child(m):
print(" rejected:", age)
else:
print(" accepted:", age)
$ python3 demo.py
jester, 42 years old, 20 years kidding
accepted: 42
accepted: 20
12 years kid
rejected: 12
engineer, 30 years
accepted: 30
engineer, 30 years old daughter
rejected: 30
More information about the Python-list
mailing list