[Tutor] How to parse large files

Wed Oct 28 05:02:17 EDT 2015

Danny Yoo wrote:

> There are several out there; one that comes standard in Python 3 is
> the "dbm" module:
> 
> https://docs.python.org/3.5/library/dbm.html
> 
> Instead of doing:
> 
> diz5 = {}
> ...
> 
> we'd do something like this:
> 
> with diz5 = dbm.open('diz5, 'c'):
> ...
> 
> And otherwise, your code will look very similar!  This dictionary-like
> object will store its data on disk, rather than in-memory, so that it
> can grow fairly large.  The other nice thing is that you can do the
> dbm creation up front.  If you run your program again, you might add a
> bit of logic to *reuse* the dbm that's already on disk, so that you
> don't have to process your input files all over again.

dbm operates on byte strings for both keys and values, so there are a few 
changes. Fortunately there's a wrapper around dbm called shelve that uses 
string keys and allows objects that can be pickled as values:

https://docs.python.org/dev/library/shelve.html

With that your code may become

with shelve.open("diz5") as db:
    with open("tmp1.txt") as instream:
        for line in instream:
            assert line.count("\t") == 1
            key, _tab, value = line.rstrip("\n").partition("\t")
            values = db.get(key) or set()
            values.add(value)
            db[key] = values

Note that while shelve has a setdefault() method it will only work as 
expected when you set writeback=True which in turn may require arbitrary 
amounts of memory.