[Tutor] A CSV field is a list of integers - how to read it as such?

Steven D'Aprano steve at pearwood.info
Mon Mar 4 08:27:46 CET 2013


On 04/03/13 17:48, DoanVietTrungAtGmail wrote:
> Don, Dave - Thanks for your help!
>
> Don: Thanks! I've just browsed the AST documentation, much of it goes over
> my head, but the ast.literal_eval helper function works beautifully for me.
>
> Dave: Again, thanks! Also, you asked "More space efficient than what?" I
> meant .csv versus dict, list, and objects. Specifically, if I read a
> 10-million row .csv file into RAM, how is its RAM footprint compared to a
> list or dict containing 10M equivalent items, or to 10M equivalent class
> instances living in RAM. I've just tested and learned that a .csv file has
> very little overhead, in the order of bytes not KB. Presumably the same
> applies when the file is read into RAM.

How many items per row? How many characters per item?

CSV files are just text files. So they'll take as much memory as they have
characters, multiplied by the number of bytes per character, e.g.:

ASCII or Latin-1: 1 byte per character

UTC-16: 2 bytes per character

UTC-32: 4 bytes per character

UTF-8: variable, depends on the characters but typically close to 1 byte for
Western-European text.


Suppose you have CSV stored in UTC-16, 10-million rows, with 1 hundred columns
per row, and each column averages 30 characters, giving approximately 6200
bytes per row, or 62 gigabytes in total. That's a pretty big file. Does your
computer have 62 GB of memory? If not, you're going to have a bit of trouble
reading in the entire file all at once...

But if you process only one row at a time, you only have to handle about 6.2 KB
per row at a time. When that gets converted into a list of strings, that will
take about 24 KB.



> As to the RAM overheads of dict, list, and class instances, I've just found
> some stackoverflow discussions.
> One<http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory>says
> that for large lists in CPython, "the overallocation is 12.5 percent".


Yes. Do you have a question about it?




-- 
Steven


More information about the Tutor mailing list