[melbourne-pug] A list of strings vs a list of list of strings
James Briggs
jimmy.briggs at gmail.com
Thu Jul 19 04:26:21 CEST 2007
Why not try using a dictionary rather two lists? It can use the ngram
as key and will take up little more than your two lists in space in
memory.
i.e.
{ 'abc': 2, 'def': 3, 'ghi':1 }
freqs = {}
for ngram in ngrams:
# if it finds the ngram in the dictionary it increments it
by 1, otherwise sets it to 1
freqs[ngram]=freqs.get(ngram,0)+1
pairs = freqs.items() # will return a list of (ngram,freq)
unique = freq.keys() # will return a list of all unique ngrams
When ever I find myself matching up indices on multiple lists I
usually discover what I really want to use is a dictionary. Especially
as the keys are 'hashed' and can be search very efficiently
Good Luck.
James
On 7/19/07, Brianna Laugher <brianna.laugher at gmail.com> wrote:
> Hello Melbourne pythonistas,
>
> A quick question...
>
> I'm doing some text processing, creating and sorting n-grams (n-word
> phrases - a single word is a 1-gram) from text files. At the moment
> I'm doing something like this:
>
> - get value for n
> - read in files
> - determine clauses (n-grams shouldn't cross clause boundaries): list
> of strings, 1 string = 1 clause
> - determine words (split on whitespace): list of list of strings, 1
> list = 1 clause, 1 string = 1 word
> - create all n-grams (list splices): resulting in an n-gram being a
> list of n strings
> - determine frequencies
>
> this last bit is where I'm possibly running into trouble. If I just
> had a list of strings, I would do this:
>
> pairs = [ ( ngram, ngrams.count(ngram) ) for ngram in list( set( ngrams ) ) ]
>
> I can't use set() when my list items are more lists, apparently. So
> I'm trying this:
>
> uniqngrams = []
> freqs = []
> for ngram in ngrams:
> try:
> indice = uniqngrams.index(ngram)
> freqs[indice] += 1
> except ValueError:
> uniqngrams.append(ngram)
> freqs.append(1)
>
> pairs = zip(uniqngrams, freqs)
>
>
> It's not throwing errors, it's just too slow (the file I'm running it
> on has nearly 224,000 words).
>
> I have a suspicion that representing the ngrams as lists of strings
> rather than plain lists is causing some of the trouble. Just wondering
> if anyone had any thoughts about this.
>
> also, in general, which of these should be faster?
>
> 'foo' in 'very long string foo'
>
> 'foo' in ['very','long','string','foo']
>
> I thought the latter would be because it should make fewer checks
> (after checking 'f'!='v', it can jump to 'l'.) But maybe it's already
> optimised for string searching. shrug.
>
> thanks,
> Brianna
> _______________________________________________
> melbourne-pug mailing list
> melbourne-pug at python.org
> http://mail.python.org/mailman/listinfo/melbourne-pug
>
More information about the melbourne-pug
mailing list