[Tutor] Bigrams and nested dictionaries
Michael Broe
mbroe at columbus.rr.com
Wed Mar 29 02:41:29 CEST 2006
I'm playing with the whole idea of creating bigram (digram?)
frequencies for text analysis and cryptographic and entropy analysis
etc (this is as much an exercise in learning Python and programming
as anything else, I realise everything has already been done
somewhere somehow :) Though I *am* aiming to run this over unicoded
phonetic representations of natural languages, which doesn't seem to
be that common.
I implemented a single character count using a dictionary, and it
worked really well:
Character_Count = {}
for character in text:
Character_Count[character] = Character_Count.get(character, 0) + 1
And so for bigrams I was thinking of creating a data-structure that
was a nested dictionary, where the key was the character in position
1, and the value was a sub-dictionary that gave a count of each
character in position 2.
So the dictionary might begin something like:
{'a': {'a':1, 'b':8, 'c':10,...}, 'b' : {'a':23, 'b':0, 'c':
1,...},..., 'z' : {'a':11, 'b':0, 'c':0,...}}
The count of the character in position one could be retrieved by
summing over the characters in position 2, so the bigram and single-
character counts can be done in one pass.
I don't want anyone to tell me how to do this :) I'd just like to get
a feel for:
(i) is this idea of a nested dictionary a good/efficient/tractable
data-structure?
(ii) is there a straightforward path to the goal of constructing this
thing in a single pass over the input string (yes or no will suffice
for the moment!), or is there a can of worms lurking in my future.
I'd rather have less information than more, but there is something
about grabbing characters two-at-a-time, but moving forward one-at-a-
time, and sticking the count down a level inside the dictionary
that's just a little baffling at the moment. On the other hand it's
like moving a two-character window across the text and recording
information as you go, which seems like it should be a good thing to
do computationally. I like being baffled, I'd just kinda like to know
if this is a good problem to work on to gain enlightenment, or a bad
one and I should think about a totally different path with a less
jazzy data structure.
More information about the Tutor
mailing list