replace only full words

Sat Sep 28 13:43:33 EDT 2013

On Saturday, September 28, 2013 4:54:35 PM UTC, Tim Chase wrote:
> On 2013-09-28 09:11, cerr wrote:
> 
> > I have a list of sentences and a list of words. Every full word
> 
> > that appears within sentence shall be extended by <WORD> i.e. "I
> 
> > drink in the house." Would become "I <drink> in the <house>." (and
> 
> > not "I <d<rink> in the <house>.")
> 
> 
> 
> This is a good place to reach for regular expressions.  It comes with
> 
> a "ensure there is a word-boundary here" token, so you can do
> 
> something like the code at the (way) bottom of this email.  I've
> 
> pushed it off the bottom in the event you want to try and use regexps
> 
> on your own first.  Or if this is homework, at least make you work a
> 
> *little* :-)
> 
> 
> 
> > Also, is there a way to make it faster?
> 
> 
> 
> The code below should do the processing in roughly O(n) time as it
> 
> only makes one pass through the data and does O(1) lookups into your
> 
> set of nouns.  I included code in the regexp to roughly find
> 
> contractions and hyphenated words.  Your original code grows slower
> 
> as your list of nouns grows bigger and also suffers from
> 
> multiple-replacement issues (if you have the noun-list of ["drink",
> 
> "rink"], you'll get results that you don't likely want.
> 
> 
> 
> My code hasn't considered case differences, but you should be able to
> 
> normalize both the list of nouns and the word you're testing in the
> 
> "modify()" function so that it would find "Drink" as well as "drink"
> 
> 
> 
> Also, note that some words serve both as nouns and other parts of
> 
> speech, e.g. "It's kind of you to house me for the weekend and drink
> 
> tea with me."
> 
> 
> 
> -tkc
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> import re
> 
> 
> 
> r = re.compile(r"""
> 
>   \b    # assert a word boundary
> 
>   \w+   # 1+ word characters
> 
>   (?:   # a group
> 
>    [-']  # a dash or apostrophe
> 
>    \w+   # followed by 1+ word characters
> 
>    )?    # make the group optional (0 or 1 instances)
> 
>   \b    # assert a word boundary here
> 
>   """, re.VERBOSE)
> 
> 
> 
> nouns = set([
> 
>   "drink",
> 
>   "house",
> 
>   ])
> 
> 
> 
> def modify(matchobj):
> 
>   word = matchobj.group(0)
> 
>   if word in nouns:
> 
>     return "<%s>" % word
> 
>   else:
> 
>     return word
> 
> 
> 
> print r.sub(modify, "I drink in the house")

Great, only I don't have the re module on my system.... :(