[Tutor] Bigrams and nested dictionaries

Michael Broe mbroe at columbus.rr.com
Wed Mar 29 02:41:29 CEST 2006

I'm playing with the whole idea of creating bigram (digram?)  
frequencies for text analysis and cryptographic and entropy analysis  
etc (this is as much an exercise in learning Python and programming  
as anything else, I realise everything has already been done  
somewhere somehow :) Though I *am* aiming to run this over unicoded  
phonetic representations of natural languages, which doesn't seem to  
be that common.

I implemented a single character count using a dictionary, and it  
worked really well:

Character_Count = {}
for character in text:
	Character_Count[character] = Character_Count.get(character, 0) + 1
And so for bigrams I was thinking of creating a data-structure that  
was a nested dictionary, where the key was the character in position  
1, and the value was a sub-dictionary that gave a count of each  
character in position 2.

So the dictionary might begin something like:

{'a': {'a':1, 'b':8, 'c':10,...}, 'b' : {'a':23, 'b':0, 'c': 
1,...},..., 'z' : {'a':11, 'b':0, 'c':0,...}}

The count of the character in position one could be retrieved by  
summing over the characters in position 2, so the bigram and single- 
character counts can be done in one pass.

I don't want anyone to tell me how to do this :) I'd just like to get  
a feel for:

(i) is this idea of a nested dictionary a good/efficient/tractable  

(ii) is there a straightforward path to the goal of constructing this  
thing in a single pass over the input string (yes or no will suffice  
for the moment!), or is there a can of worms lurking in my future.

I'd rather have less information than more, but there is something  
about grabbing characters two-at-a-time, but moving forward one-at-a- 
time, and sticking the count down a level inside the dictionary  
that's just a little baffling at the moment. On the other hand it's  
like moving a two-character window across the text and recording  
information as you go, which seems like it should be a good thing to  
do computationally. I like being baffled, I'd just kinda like to know  
if this is a good problem to work on to gain enlightenment, or a bad  
one and I should think about a totally different path with a less  
jazzy data structure.

More information about the Tutor mailing list