constructing and using large lexicon in a program

Peter Otten __peter__ at web.de
Tue Aug 3 04:00:26 EDT 2010


Majdi Sawalha wrote:

> I am developing a morphological analyzer that depends on a large lexicon.
> i construct a Lexicon class that reades a text file and construct a
> dictionary of the lexicon entries.
> the other class will use the lexicon class to chech if the word is found
> in the lexicon. the problem that this takes long time as each time an
> object of that class created, then it needs to call the lexicon many
> times. then when the lexicon is called it re-construct the lexicon again.
> is there any way to construct the lexicon one time during the execution of
> the program? and then the other modules will search the already
> constructed lexicon.

Normally you just structure your application accordingly. Load the dictionary 
once and then pass it around explicitly:

import loader
import user_one
import user_two

filename = ...
large_dict = loader.load(filename)

user_one.use_dict(large_dict)
user_two.use_dict(large_dict)

You may also try a caching scheme to avoid parsing the text file unless it has 
changed. Here's a simple example:

$ cat cachedemo.py
import cPickle as pickle
import os


def load_from_text(filename):
    # replace with your code
    with open(filename) as instream:
        return dict(line.strip().split(None, 1) for line in instream)

def load(filename, cached=None):
    if cached is None:
        cached = filename + ".pickle"
    if os.path.exists(cached) and os.path.getmtime(filename) <= os.path.getmtime(cached):
        print "using pickle"
        with open(cached, "rb") as instream:
            return pickle.load(instream)
    else:
        print "loading from text"
        d = load_from_text(filename)
        with open(cached, "wb") as out:
            pickle.dump(d, out, pickle.HIGHEST_PROTOCOL)
        return d


if __name__ == "__main__":
    if not os.path.exists("tmp.txt"):
        print "creating example data"
        with open("tmp.txt", "w") as out:
            out.write("""\
alpha value for alpha
beta BETA
gamma GAMMA
""")
    print load("tmp.txt")

$ python cachedemo.py
creating example data
loading from text
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA'}
$ python cachedemo.py
using pickle
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA'}
$ echo 'delta modified text' >> tmp.txt
$ python cachedemo.py
loading from text
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA', 'delta': 'modified text'}
$ python cachedemo.py
using pickle
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA', 'delta': 'modified text'}

Peter





More information about the Python-list mailing list