[Tutor] A CSV field is a list of integers - how to read it as such?

Dave Angel davea at davea.name
Mon Mar 4 14:24:47 CET 2013


On 03/04/2013 01:48 AM, DoanVietTrungAtGmail wrote:
> Don, Dave - Thanks for your help!
>
> Don: Thanks! I've just browsed the AST documentation, much of it goes over
> my head, but the ast.literal_eval helper function works beautifully for me.
>
> Dave: Again, thanks! Also, you asked "More space efficient than what?" I
> meant .csv versus dict, list, and objects. Specifically, if I read a
> 10-million row .csv file into RAM, how is its RAM footprint compared to a
> list or dict containing 10M equivalent items, or to 10M equivalent class
> instances living in RAM.

Once a csv file has been read by a csv reader (such as DictReader), it's 
no longer a csv file.  The data in memory never exists as a copy of the 
file on disk.  The way you wrote the code, each row exists as a dict of 
strings, but more commonly, each row would exist as a list of strings.

The csv logic does not keep more than one row at a time, so if you want 
a big list to exist at one time, you'll be making one yourself. 
(Perhaps by using append inside the loop instead of the print you're 
doing now).

So the question is not how much RAM does the csvdata take up, but how 
much RAM is used by whatever form you use.  In that, you shouldn't worry 
about the overhead of the list, but the overhead of however you store 
each individual row.  When a list overallocates, the "unused rows" each 
take up 4 or 8 bytes, as opposed to probably thousands of bytes for each 
row that is used.

  I've just tested and learned that a .csv file has
> very little overhead, in the order of bytes not KB. Presumably the same
> applies when the file is read into RAM.
>
> As to the RAM overheads of dict, list, and class instances, I've just found
> some stackoverflow discussions.
> One<http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory>says
> that for large lists in CPython, "the
> overallocation is 12.5 percent".
>

So the first question is whether you really need the data to all be 
instantly addressable in RAM at one time.  If you can do all your 
processing a row at a time, then the problem goes away.

Assuming you do need random access to the rows, then the next thing to 
consider is whether a dict is the best way to describe the "columns". 
Since every dict has the same keys, and since they're presumably known 
to your source code, then a custom class for the row is probably better, 
and a namedtuple is probably exactly what you want. There is then no 
overhead for the names of the columns, and the elements of the tuple are 
either ints or lists of ints.

If that's not compact enough, then the next thing to consider is how you 
store those ints.  If there's lots of them, and especially if you can 
constrain how big the largest is, then you could use the array module. 
It assumes all the numeric items are limited to a particular size, and 
you can specify that size.  For example, if all the ints are nonnegative 
and less than 256, you could do:

import array
myarray = array.array('b', mylist)

An array is somewhat slower than a list, but it holds lots more integers 
in a given space.

Since ram size is your concern, the fact that you happen to serialize it 
into a csv is irrelevant.  That's a good choice if you want to be able 
to examine the data in a text editor, or import it into a spreadsheet. 
If you have other requirements, we can figure them out in a separate 
question.

-- 
DaveA


More information about the Tutor mailing list