[Tutor] Bigrams and nested dictionaries

Kent Johnson kent37 at tds.net
Wed Mar 29 05:30:53 CEST 2006


Michael Broe wrote:
> I'm playing with the whole idea of creating bigram (digram?)  
> frequencies for text analysis and cryptographic and entropy analysis  
> etc (this is as much an exercise in learning Python and programming  
> as anything else, I realise everything has already been done  
> somewhere somehow :) Though I *am* aiming to run this over unicoded  
> phonetic representations of natural languages, which doesn't seem to  
> be that common.
> 
> I implemented a single character count using a dictionary, and it  
> worked really well:
> 
> Character_Count = {}
> for character in text:
> 	Character_Count[character] = Character_Count.get(character, 0) + 1
> 	
> And so for bigrams I was thinking of creating a data-structure that  
> was a nested dictionary, where the key was the character in position  
> 1, and the value was a sub-dictionary that gave a count of each  
> character in position 2.
> 
> So the dictionary might begin something like:
> 
> {'a': {'a':1, 'b':8, 'c':10,...}, 'b' : {'a':23, 'b':0, 'c': 
> 1,...},..., 'z' : {'a':11, 'b':0, 'c':0,...}}
> 
> The count of the character in position one could be retrieved by  
> summing over the characters in position 2, so the bigram and single- 
> character counts can be done in one pass.
> 
> I don't want anyone to tell me how to do this :) I'd just like to get  
> a feel for:
> 
> (i) is this idea of a nested dictionary a good/efficient/tractable  
> data-structure?

It can work. An alternative is a dictionary keyed by the two-letter 
pairs. I don't know which will work better.
> 
> (ii) is there a straightforward path to the goal of constructing this  
> thing in a single pass over the input string (yes or no will suffice  
> for the moment!), or is there a can of worms lurking in my future.

I think so.
> 
> I'd rather have less information than more, but there is something  
> about grabbing characters two-at-a-time, but moving forward one-at-a- 
> time, and sticking the count down a level inside the dictionary  
> that's just a little baffling at the moment. 

Encapsulating this in a generator is probably a good plan. If you want 
help look at the online cookbook. The generator can yield two-character 
strings or tuples of characters, whichever is more convenient. When you 
have a generator for the pairs the rest of the program will be pretty 
straightforward.

Kent



More information about the Tutor mailing list