Dictionary size/speed limit?

Thu Jun 1 00:23:38 EDT 2000

"Michael Morrison" <borlak at home.com> writes:

> So, is using the regular Python dictionary just an insane idea? :)  Should I
> create something in C to do this for me, or are there modules to do this
> sort of thing?

It's certainly not insane, and I'd bet it'll do fine for your application.
Whether or not Python would serve for a final application is going to
be somewhat application and machine dependent, but as a really quick
example - here's something I just typed in interactively:

   Python 1.5.2 (#0, Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)] on win32
   Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
   >>> import time
   >>>
   >>> start = time.time()
   >>> dict = {}
   >>> file = open('words')
   >>> while 1:
   ...   line = file.readline()
   ...   if not line: break
   ...   dict[line] = line
   ...
   >>> load = time.time()
   >>> file.seek(0)
   >>> print len(dict.keys())
   109582
   >>> while 1:
   ...   line = file.readline()
   ...   if not line: break
   ...   value = dict[line]
   ...
   >>> value
   'zyzzyvas\012'
   >>> stop = time.time()
   >>> print start, load, stop
   959832750.636 959832779.768 959832820.086
   >>> print load-start, stop-load, stop-start
   29.1319999695 40.317999959 69.4499999285

This loaded a Python dictionary with a words file with about 100000
english words.  For simplicity I made the dictionary value the same as
the key (and didn't bother stripping the trailing NL).  I then re-read
the same file doing a lookup for each key.  And note - the timing
includes the time to type the information in (which was actually
probably a half to two thirds of each period, even though I type quickly).

The dictionary values themselves were about 1MB in pure character data,
so I've got 2MB of raw data in the in-memory dictionary.  To be fair, my
Python process (under Windows NT) showed growth to about 19MB, so
whether that's a problem depends on environment.  I may have also
artificially increased that by my len() check since that had to create a
list with the same 1MB of data and I'm not positive when that reference
would get cleaned up.  And of course, this is with 100000 entries.

At a minimum, it should be easy to use Python for any initial development,
leaving you to be able to be concerned with overall design and algorithms
and not data management.  If it proves to be a real problem, it can always
be pushed off to a C implementation later.

--
-- David
-- 
/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l at fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/