[Tutor] How to parse large files
Peter Otten
__peter__ at web.de
Wed Oct 28 05:02:17 EDT 2015
Danny Yoo wrote:
> There are several out there; one that comes standard in Python 3 is
> the "dbm" module:
>
> https://docs.python.org/3.5/library/dbm.html
>
> Instead of doing:
>
> diz5 = {}
> ...
>
> we'd do something like this:
>
> with diz5 = dbm.open('diz5, 'c'):
> ...
>
> And otherwise, your code will look very similar! This dictionary-like
> object will store its data on disk, rather than in-memory, so that it
> can grow fairly large. The other nice thing is that you can do the
> dbm creation up front. If you run your program again, you might add a
> bit of logic to *reuse* the dbm that's already on disk, so that you
> don't have to process your input files all over again.
dbm operates on byte strings for both keys and values, so there are a few
changes. Fortunately there's a wrapper around dbm called shelve that uses
string keys and allows objects that can be pickled as values:
https://docs.python.org/dev/library/shelve.html
With that your code may become
with shelve.open("diz5") as db:
with open("tmp1.txt") as instream:
for line in instream:
assert line.count("\t") == 1
key, _tab, value = line.rstrip("\n").partition("\t")
values = db.get(key) or set()
values.add(value)
db[key] = values
Note that while shelve has a setdefault() method it will only work as
expected when you set writeback=True which in turn may require arbitrary
amounts of memory.
More information about the Tutor
mailing list