[Tutor] How to parse large files

Tue Oct 27 22:32:06 EDT 2015

On Tue, Oct 27, 2015 at 2:32 PM, jarod_v6--- via Tutor <tutor at python.org> wrote:
> Hi!
> I want to reads two files and create simple  dictionary.  Input file contain more than 10000 rows
>
> diz5 = {}
> with open("tmp1.txt") as p:
>     for i in p:
>         lines = i.rstrip("\n").split("\t")
>         diz5.setdefault(lines[0],set()).add(lines[1])
>
> diz3 = {}
> with open("tmp2.txt") as p:
>     for i in p:
>         lines = i.rstrip("\n").split("\t")
>         diz3.setdefault(lines[0],set()).add(lines[1])

10000 rows today is not a lot of data, since typical computer memories
have grown quite a bit.  I get the feeling your program should be able
to handle this all in-memory.

But let's assume, for the moment, that you do need to deal with a lot
of data, where you can't hold the whole thing in memory.  Ideally,
you'd like to have access to its contents in a key/value store,
because that feels most like a Python dict.  If that's the case, then
what you're looking for is a on on-disk database.

There are several out there; one that comes standard in Python 3 is
the "dbm" module:

    https://docs.python.org/3.5/library/dbm.html

Instead of doing:

    diz5 = {}
    ...

we'd do something like this:

    with diz5 = dbm.open('diz5, 'c'):
        ...

And otherwise, your code will look very similar!  This dictionary-like
object will store its data on disk, rather than in-memory, so that it
can grow fairly large.  The other nice thing is that you can do the
dbm creation up front.  If you run your program again, you might add a
bit of logic to *reuse* the dbm that's already on disk, so that you
don't have to process your input files all over again.

Databases, too, have capacity limits, but you're unlikely to hit them
unless you're really doing something hefty.  And that's out of scope
for tutor at python.org.  :P