[Tutor] wierd replace problem
steve at pearwood.info
Tue Sep 14 01:39:29 CEST 2010
On Tue, 14 Sep 2010 09:08:24 am Joel Goldstick wrote:
> On Mon, Sep 13, 2010 at 6:41 PM, Steven D'Aprano
<steve at pearwood.info>wrote:
> > On Tue, 14 Sep 2010 04:18:36 am Joel Goldstick wrote:
> > > How about using str.split() to put words in a list, then run
> > > strip() over each word with the required characters to be removed
> > > ('`")
> > Doesn't work. strip() only removes characters at the beginning and
> > end of the word, not in the middle:
> Exactly, you first split the words into a list of words, then strip
> each word
Of course, if you don't want to remove ALL punctuation marks, but only
those at the beginning and end of words, then strip() is a reasonable
approach. But if the aim is to strip out all punctuation, no matter
where, then it can't work.
Since the aim is to count words, a better approach might be a hybrid --
remove all punctuation marks like commas, fullstops, etc. no matter
where they appear, keep internal apostrophes so that words like "can't"
are different from "cant", but remove external ones. Although that
loses information in the case of (e.g.) dialect speech:
"'e said 'e were going to kill the lady, Mister Holmes!"
cried the lad excitedly.
You probably want to count the word as 'e rather than just e.
And hyphenation is tricky to. A lone hyphen - like these - should be
deleted. But double-dashes--like these--are word separators, so need to
be replaced by a space. Otherwise, single hyphens should be kept. If a
word begins or ends with a hyphen, it should be be joined up with the
previous or next word. But then it gets more complicated, because you
don't know whether to keep the hyphen after joining or not.
E.g. if the line ends with:
blah blah blah blah some-
thing blah blah blah.
should the joined up word become the compound word "some-thing" or the
regular word "something"? In general, there's no way to be sure,
although you can make a good guess by looking it up in a dictionary and
assuming that regular words should be preferred to compound words. But
that will fail if the word has changed over time, such as "cooperate",
which until very recently used to be written "co-operate", and before
that as "coöperate".
More information about the Tutor