[Tutor] Cheating?No, not me:-)
Nicole Seitz
nicole.seitz@urz.uni-hd.de
Wed, 17 Apr 2002 19:54:15 +0200
Am Freitag, 5. April 2002 23:38 schrieb dman:
> On Fri, Apr 05, 2002 at 10:42:58PM +0200, Nicole Seitz wrote:
> | Am Sonntag, 24. M=E4rz 2002 23:01 schrieb dman:
> | > | For people who are interested: here's a similar problem: "Given a
> | > | text file and an integer K, you are to print the K most common wo=
rds
> | > | in the file (and the number of their occurences) in decreasing
> | > | frequency."
> | >
> | > <cheater's hint>
> | > Search the tutor archives. With only a couple minor modifications =
the
> | > answer is already there.
> | > </cheater's hint>
> |
> | Some hint WHERE in the archives I could find the answer?Month?Year?
>
> I don't remember, but try googling for :
> python tutor word count slow perl
>
> which yields this as the second result :
> http://mail.python.org/pipermail/tutor/2001-February/003403.html
>
> apparently it was February 1, 2001.
>
> Enjoy :-).
Thanks!
My program now runs almost perfectly. And I solved the problem how to pri=
nt=20
the K most common words. Maybe there's an easier way to determine the mos=
t=20
common words, I don't know.
Here's the little function that deals with the most common words.What do =
you=20
think of it?
Note: occ is the dictionary where I store the words and their occurences,=
e.g.=20
occ =3D { "hello":3,"you":123,"fool":23}
-------------------------------------------------------------------------=
------
def MostCommonWords(occ,K):=20
dict =3D{}
for key in occ.keys():
if dict.has_key(occ[key]):
dict[occ[key]].append(key)
else:
dict[occ[key]] =3D [key]
key_list =3D dict.keys()
key_list.sort()
key_list.reverse()
print "Most common word(s): "
for i in range(int(K)):
for word in dict[key_list[i]]:
print "%-8s" % word, "\t",
#print dict[key_list[ i]],
print" (occurences: %2i) " % key_list[i]
-------------------------------------------------------------------------=
-------------
Last but not least, I've got some questions on regexes and other stuff wh=
ich=20
you use in your script.
# remove leading and trailing whitespace
line =3D string.strip(line)
Why that?
# split the string into a list of words
# a word is delimited by whitespace or punctuation
for word in re.split(
"[" + string.whitespace + string.punctuation + "]+",
line):
DOn't understand this regex.Could you explain,please?I guess=20
string.punctuation is [.,;:?!].But what's the meaning of [" + bla + bla +=
"] =20
???
# check to make sure the string is considered a word
if re.match("^[" + string.lowercase + "]+$", word):
Is it necessary to do this.Can't I be sure that thestring is considered a=
=20
word?
Many thanks for your help.
Nicole
>
> -D