Smart text parsing
Hans Nowak
hans at zephyrfalcon.org
Thu Feb 5 23:17:39 EST 2004
Mathias Mamsch wrote:
> I got a text with about 1 million words where I want to count words and put
> them sorted to a list
> like " list = [(most-common-word,1001),(2nd-word,986), ...] "
>
> I think there are at about 10% (about 100.000) different words in the text.
>
> I am wondering if you can give me something faster than my approach:
> My first straightforward approach was:
> ----
> s = "Hello this is my 1 million word text".split()
>
> s2 = s.split()
> dict = {}
> for i in s2: # the loop needs 10s
> if dict.has_key(i):
> dict[i] += 1
> else:
> dict[i] = 1
> list = dict.items()
> # this is slow:
> list.sort(lambda x,y: 2*(x[1] < y[1])-1)
> ----
Passing a comparison function to sort slows things down a lot. Try something
like this instead:
parts = "Hello this is my 1 million word text".split()
for part in parts:
if d.has_key(part):
d[part] += 1
else:
d[part] = 1
lst = d.items()
lst = [(t[1], t[0]) for t in lst] # (frequency, string)
lst.sort() # sort as usual
lst.reverse() # reverse, so highest numbers are first
HTH,
--
Hans (hans at zephyrfalcon.org)
http://zephyrfalcon.org/
More information about the Python-list
mailing list