[Tutor] A CSV field is a list of integers - how to read it as such?
Dave Angel
davea at davea.name
Mon Mar 4 14:24:47 CET 2013
On 03/04/2013 01:48 AM, DoanVietTrungAtGmail wrote:
> Don, Dave - Thanks for your help!
>
> Don: Thanks! I've just browsed the AST documentation, much of it goes over
> my head, but the ast.literal_eval helper function works beautifully for me.
>
> Dave: Again, thanks! Also, you asked "More space efficient than what?" I
> meant .csv versus dict, list, and objects. Specifically, if I read a
> 10-million row .csv file into RAM, how is its RAM footprint compared to a
> list or dict containing 10M equivalent items, or to 10M equivalent class
> instances living in RAM.
Once a csv file has been read by a csv reader (such as DictReader), it's
no longer a csv file. The data in memory never exists as a copy of the
file on disk. The way you wrote the code, each row exists as a dict of
strings, but more commonly, each row would exist as a list of strings.
The csv logic does not keep more than one row at a time, so if you want
a big list to exist at one time, you'll be making one yourself.
(Perhaps by using append inside the loop instead of the print you're
doing now).
So the question is not how much RAM does the csvdata take up, but how
much RAM is used by whatever form you use. In that, you shouldn't worry
about the overhead of the list, but the overhead of however you store
each individual row. When a list overallocates, the "unused rows" each
take up 4 or 8 bytes, as opposed to probably thousands of bytes for each
row that is used.
I've just tested and learned that a .csv file has
> very little overhead, in the order of bytes not KB. Presumably the same
> applies when the file is read into RAM.
>
> As to the RAM overheads of dict, list, and class instances, I've just found
> some stackoverflow discussions.
> One<http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory>says
> that for large lists in CPython, "the
> overallocation is 12.5 percent".
>
So the first question is whether you really need the data to all be
instantly addressable in RAM at one time. If you can do all your
processing a row at a time, then the problem goes away.
Assuming you do need random access to the rows, then the next thing to
consider is whether a dict is the best way to describe the "columns".
Since every dict has the same keys, and since they're presumably known
to your source code, then a custom class for the row is probably better,
and a namedtuple is probably exactly what you want. There is then no
overhead for the names of the columns, and the elements of the tuple are
either ints or lists of ints.
If that's not compact enough, then the next thing to consider is how you
store those ints. If there's lots of them, and especially if you can
constrain how big the largest is, then you could use the array module.
It assumes all the numeric items are limited to a particular size, and
you can specify that size. For example, if all the ints are nonnegative
and less than 256, you could do:
import array
myarray = array.array('b', mylist)
An array is somewhat slower than a list, but it holds lots more integers
in a given space.
Since ram size is your concern, the fact that you happen to serialize it
into a csv is irrelevant. That's a good choice if you want to be able
to examine the data in a text editor, or import it into a spreadsheet.
If you have other requirements, we can figure them out in a separate
question.
--
DaveA
More information about the Tutor
mailing list