how to remove the same words in the paragraph

Tim Chase python.list at tim.thechases.com
Mon Nov 9 07:13:30 EST 2009


> I think simple regex may come handy,
> 
>   p=re.compile(r'(.+) .*\1')    #note the space
>   s=p.search("python and i love python")
>   s.groups()
>   (' python',)
> 
> But that matches for only one double word.Someone else could light up here
> to extract all the double words.Then they can be removed from the original
> paragraph.

This has multiple problems:

 >>> p = re.compile(r'(.+) .*\1')
 >>> s = p.search("python one two one two python")
 >>> s.groups()
('python',)
 >>> s = p.search("python one two one two python one")
 >>> s.groups() # guess what happened to the 2nd "one"...
('python one',)

and even once you have the list of theoretical duplicates (by 
changing the regexp to r'\b(\w+)\b.*?\1' perhaps), you still have 
to worry about emitting the first instance but not subsequent 
instances.

-tkc







More information about the Python-list mailing list