replace only full words
cerr
ron.eggler at gmail.com
Sat Sep 28 13:43:33 EDT 2013
On Saturday, September 28, 2013 4:54:35 PM UTC, Tim Chase wrote:
> On 2013-09-28 09:11, cerr wrote:
>
> > I have a list of sentences and a list of words. Every full word
>
> > that appears within sentence shall be extended by <WORD> i.e. "I
>
> > drink in the house." Would become "I <drink> in the <house>." (and
>
> > not "I <d<rink> in the <house>.")
>
>
>
> This is a good place to reach for regular expressions. It comes with
>
> a "ensure there is a word-boundary here" token, so you can do
>
> something like the code at the (way) bottom of this email. I've
>
> pushed it off the bottom in the event you want to try and use regexps
>
> on your own first. Or if this is homework, at least make you work a
>
> *little* :-)
>
>
>
> > Also, is there a way to make it faster?
>
>
>
> The code below should do the processing in roughly O(n) time as it
>
> only makes one pass through the data and does O(1) lookups into your
>
> set of nouns. I included code in the regexp to roughly find
>
> contractions and hyphenated words. Your original code grows slower
>
> as your list of nouns grows bigger and also suffers from
>
> multiple-replacement issues (if you have the noun-list of ["drink",
>
> "rink"], you'll get results that you don't likely want.
>
>
>
> My code hasn't considered case differences, but you should be able to
>
> normalize both the list of nouns and the word you're testing in the
>
> "modify()" function so that it would find "Drink" as well as "drink"
>
>
>
> Also, note that some words serve both as nouns and other parts of
>
> speech, e.g. "It's kind of you to house me for the weekend and drink
>
> tea with me."
>
>
>
> -tkc
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> import re
>
>
>
> r = re.compile(r"""
>
> \b # assert a word boundary
>
> \w+ # 1+ word characters
>
> (?: # a group
>
> [-'] # a dash or apostrophe
>
> \w+ # followed by 1+ word characters
>
> )? # make the group optional (0 or 1 instances)
>
> \b # assert a word boundary here
>
> """, re.VERBOSE)
>
>
>
> nouns = set([
>
> "drink",
>
> "house",
>
> ])
>
>
>
> def modify(matchobj):
>
> word = matchobj.group(0)
>
> if word in nouns:
>
> return "<%s>" % word
>
> else:
>
> return word
>
>
>
> print r.sub(modify, "I drink in the house")
Great, only I don't have the re module on my system.... :(
More information about the Python-list
mailing list