referencing a subhash for generalized ngram counting

braver deliverable at
Tue Nov 13 17:02:08 CET 2007

Greetings: I wonder how does one uses single-name variables to refer
to nested sunhashes (subdictionaries).  Here's an example:

In [41]: orig = { 'abra':{'foo':7, 'bar':9}, 'ca':{}, 'dabra':{'baz':
4} }

In [42]: orig
Out[42]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

In [43]: h = orig['ca']

In [44]: h = { 'adanac':69 }

In [45]: h
Out[45]: {'adanac': 69}

In [46]: orig
Out[46]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

I want to change orig['ca'], which is determined somewhere else in a
program's logic, where subhashes are referred to as h -- e.g., for x
in orig: ... .  But assigning to h doesn't change orig.

The real-life motivation for this is n-gram counting.  Say you want to
maintain a hash for bigrams.  For each two subsequent words a, b in a
text, you do

bigram_count[a][b] += 1

-- notice you do want to have nested subhashes as it decreases memory
usage dramatically.

In order to generalize this to N-grammity, you want to do something

h = bigram_count
# iterating over i, not word, to notice the last i
for i in range(len(ngram):
  word = ngram[i]
  if word not in h:
    if i < N:
      h[word] = {}
      h[word] = 0
  h = h[word]
h += 1

-- doesn't work and is just a sketch; also, if at any level we get an
empty subhash, we can short-circuit vivify all remaining levels and
add 1 in the lowest, count, level.

Yet since names are not exactly references, something else is needed
for generalized ngram multi-level counting hash -- what?


More information about the Python-list mailing list