referencing a subhash for generalized ngram counting
braver
deliverable at gmail.com
Tue Nov 13 12:23:09 EST 2007
Here's a working version of the ngram counter with nested dict, wonder
how it can be improved!
lines = ["abra ca dabra",
"abra ca shvabra",
"abra movich roman",
"abra ca dabra",
"a bra cadadra"]
ngrams = [x.split() for x in lines]
N = 3
N1 = N-1
orig = {}
for ngram in ngrams:
h = orig
# iterating over i, not word, to notice the last i
for i in range(N):
word = ngram[i]
if word not in h:
if i < N1: # (*)
h[word] = {}
else:
h[word] = 0
if i < N1:
h = h[word]
print i, h
h[word] += 1
print orig
-- e.g., perhaps we could do short-circuit vivification to the end in
(*)?
Cheers,
Alexy
More information about the Python-list
mailing list