[Tutor] Bigrams and nested dictionaries
Kent Johnson
kent37 at tds.net
Wed Mar 29 05:30:53 CEST 2006
Michael Broe wrote:
> I'm playing with the whole idea of creating bigram (digram?)
> frequencies for text analysis and cryptographic and entropy analysis
> etc (this is as much an exercise in learning Python and programming
> as anything else, I realise everything has already been done
> somewhere somehow :) Though I *am* aiming to run this over unicoded
> phonetic representations of natural languages, which doesn't seem to
> be that common.
>
> I implemented a single character count using a dictionary, and it
> worked really well:
>
> Character_Count = {}
> for character in text:
> Character_Count[character] = Character_Count.get(character, 0) + 1
>
> And so for bigrams I was thinking of creating a data-structure that
> was a nested dictionary, where the key was the character in position
> 1, and the value was a sub-dictionary that gave a count of each
> character in position 2.
>
> So the dictionary might begin something like:
>
> {'a': {'a':1, 'b':8, 'c':10,...}, 'b' : {'a':23, 'b':0, 'c':
> 1,...},..., 'z' : {'a':11, 'b':0, 'c':0,...}}
>
> The count of the character in position one could be retrieved by
> summing over the characters in position 2, so the bigram and single-
> character counts can be done in one pass.
>
> I don't want anyone to tell me how to do this :) I'd just like to get
> a feel for:
>
> (i) is this idea of a nested dictionary a good/efficient/tractable
> data-structure?
It can work. An alternative is a dictionary keyed by the two-letter
pairs. I don't know which will work better.
>
> (ii) is there a straightforward path to the goal of constructing this
> thing in a single pass over the input string (yes or no will suffice
> for the moment!), or is there a can of worms lurking in my future.
I think so.
>
> I'd rather have less information than more, but there is something
> about grabbing characters two-at-a-time, but moving forward one-at-a-
> time, and sticking the count down a level inside the dictionary
> that's just a little baffling at the moment.
Encapsulating this in a generator is probably a good plan. If you want
help look at the online cookbook. The generator can yield two-character
strings or tuples of characters, whichever is more convenient. When you
have a generator for the pairs the rest of the program will be pretty
straightforward.
Kent
More information about the Tutor
mailing list