[Tutor] A CSV field is a list of integers - how to read it as such?
DoanVietTrungAtGmail
doanviettrung at gmail.com
Mon Mar 4 07:48:28 CET 2013
Don, Dave - Thanks for your help!
Don: Thanks! I've just browsed the AST documentation, much of it goes over
my head, but the ast.literal_eval helper function works beautifully for me.
Dave: Again, thanks! Also, you asked "More space efficient than what?" I
meant .csv versus dict, list, and objects. Specifically, if I read a
10-million row .csv file into RAM, how is its RAM footprint compared to a
list or dict containing 10M equivalent items, or to 10M equivalent class
instances living in RAM. I've just tested and learned that a .csv file has
very little overhead, in the order of bytes not KB. Presumably the same
applies when the file is read into RAM.
As to the RAM overheads of dict, list, and class instances, I've just found
some stackoverflow discussions.
One<http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory>says
that for large lists in CPython, "the
overallocation is 12.5 percent".
Trung Doan
============
On Mon, Mar 4, 2013 at 2:12 PM, Dave Angel <davea at davea.name> wrote:
> On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote:
>
>> Dear tutors
>>
>> I am checking out csv as a possible data structure for my records. In each
>> record, some fields are an integer and some are a list of integers of
>> variable length. I use csv.DictWriter to write data. When reading out
>> using
>> csv.DictReader, each row is read as a string, per the csv module's
>> standard
>> behaviour. To get these columns as lists of integers, I can think of only
>> a
>> multi-step process: first, remove the brackets enclosing the string;
>> second, split the string into a list containing substrings; third, convert
>> each substring into an integer. This process seems inelegant. Is there a
>> better way to get integers and lists of integers from a csv file?
>>
>> Or, is a csv file simply not the best data structure given the above
>> requirement?
>>
>
> Your terminology is very confusing. A csv is not a data structure, it's a
> method of serializing lists of strings. Or in this case dicts of strings.
> If a particular dict value isn't a string, it'll get converted to one
> implicitly. csv does not handle variable length records, so this is close
> to the best you're going to do.
>
>
> Apart from csv, I considered using a dict or list, or using an
>
>> object to represent each row.
>>
>
> Objects don't exist in a file, so they don't persist between multiple runs
> of the program. Likewise dict and list. So no idea what you really meant.
>
>
> I am being attracted to csv because csv means
>
>> serialisation is unnecessary, I just need to close and open the file to
>> stop and continue later (it's a simulation experiment).
>>
>
> Closing and opening don't do anything to persist data, but we can guess
> you must have meant to imply reading and writing as well. And you've
> nicely finessed the serialization in the write step, but as you discovered,
> you'll have to handle the deserialization to get back to ints and list.
>
>
> Also, I am guessing
>
>> but haven't checked, csv is more space efficient.
>>
>
> More space efficient than what?
>
>
> Each row contains a few
>
>> integers plus a few lists containing hundreds of integers, and there will
>> be up to hundreds of millions of rows.
>>
>> CODE: My Python 2.7 code is below. It doesn't have the third step
>> (substring -> int).
>>
>> import csv
>>
>> record1 = {'id':1, 'type':1, 'level':1, 'ListInRecord':[2, 9]}
>> record2 = {'id':2, 'type':1, 'level':1, 'ListInRecord':[1, 9]}
>> record3 = {'id':3, 'type':2, 'level':1, 'ListInRecord':[2]}
>> record9 = {'id':9, 'type':3, 'level':0, 'ListInRecord':[]}
>> rows = [record1, record2, record3, record9]
>> header = ['id', 'type', 'level', 'ListInRecord']
>>
>> with open('testCSV.csv', 'wb') as f:
>> fCSV = csv.DictWriter(f, header)
>> fCSV.writeheader()
>> fCSV.writerows(rows)
>>
>> with open('testCSV.csv', 'r') as f:
>> fCSV = csv.DictReader(f)
>> for row in fCSV:
>>
>
> I'd add the deserialization here. For each item in row, if the value
> begins and ends with [ ] then make it into a list, and if a digit or
> minus-sign, make it into an int. Then for the lists, convert each element
> to an int. You can use Don Jennings suggestion to save a lost of effort
> here.
>
> This should reconstruct the original recordn precisely. But it'll take
> some testing to be sure.
>
>
> print 'ID=', row['id'],'ListInRecord=',
>> row['ListInRecord'][1:-1].**split(', ') # I want this to be a list of
>> integers, NOT list of strings
>>
>> OUTPUT:
>>
>> ID= 1 ListInRecord= ['2', '9']
>> ID= 2 ListInRecord= ['1', '9']
>> ID= 3 ListInRecord= ['2']
>> ID= 9 ListInRecord= ['']
>>
>>
>
> --
> DaveA
> ______________________________**_________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/**mailman/listinfo/tutor<http://mail.python.org/mailman/listinfo/tutor>
>
On Mon, Mar 4, 2013 at 2:12 PM, Dave Angel <davea at davea.name> wrote:
> On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote:
>
>> Dear tutors
>>
>> I am checking out csv as a possible data structure for my records. In each
>> record, some fields are an integer and some are a list of integers of
>> variable length. I use csv.DictWriter to write data. When reading out
>> using
>> csv.DictReader, each row is read as a string, per the csv module's
>> standard
>> behaviour. To get these columns as lists of integers, I can think of only
>> a
>> multi-step process: first, remove the brackets enclosing the string;
>> second, split the string into a list containing substrings; third, convert
>> each substring into an integer. This process seems inelegant. Is there a
>> better way to get integers and lists of integers from a csv file?
>>
>> Or, is a csv file simply not the best data structure given the above
>> requirement?
>>
>
> Your terminology is very confusing. A csv is not a data structure, it's a
> method of serializing lists of strings. Or in this case dicts of strings.
> If a particular dict value isn't a string, it'll get converted to one
> implicitly. csv does not handle variable length records, so this is close
> to the best you're going to do.
>
>
> Apart from csv, I considered using a dict or list, or using an
>
>> object to represent each row.
>>
>
> Objects don't exist in a file, so they don't persist between multiple runs
> of the program. Likewise dict and list. So no idea what you really meant.
>
>
> I am being attracted to csv because csv means
>
>> serialisation is unnecessary, I just need to close and open the file to
>> stop and continue later (it's a simulation experiment).
>>
>
> Closing and opening don't do anything to persist data, but we can guess
> you must have meant to imply reading and writing as well. And you've
> nicely finessed the serialization in the write step, but as you discovered,
> you'll have to handle the deserialization to get back to ints and list.
>
>
> Also, I am guessing
>
>> but haven't checked, csv is more space efficient.
>>
>
> More space efficient than what?
>
>
> Each row contains a few
>
>> integers plus a few lists containing hundreds of integers, and there will
>> be up to hundreds of millions of rows.
>>
>> CODE: My Python 2.7 code is below. It doesn't have the third step
>> (substring -> int).
>>
>> import csv
>>
>> record1 = {'id':1, 'type':1, 'level':1, 'ListInRecord':[2, 9]}
>> record2 = {'id':2, 'type':1, 'level':1, 'ListInRecord':[1, 9]}
>> record3 = {'id':3, 'type':2, 'level':1, 'ListInRecord':[2]}
>> record9 = {'id':9, 'type':3, 'level':0, 'ListInRecord':[]}
>> rows = [record1, record2, record3, record9]
>> header = ['id', 'type', 'level', 'ListInRecord']
>>
>> with open('testCSV.csv', 'wb') as f:
>> fCSV = csv.DictWriter(f, header)
>> fCSV.writeheader()
>> fCSV.writerows(rows)
>>
>> with open('testCSV.csv', 'r') as f:
>> fCSV = csv.DictReader(f)
>> for row in fCSV:
>>
>
> I'd add the deserialization here. For each item in row, if the value
> begins and ends with [ ] then make it into a list, and if a digit or
> minus-sign, make it into an int. Then for the lists, convert each element
> to an int. You can use Don Jennings suggestion to save a lost of effort
> here.
>
> This should reconstruct the original recordn precisely. But it'll take
> some testing to be sure.
>
>
> print 'ID=', row['id'],'ListInRecord=',
>> row['ListInRecord'][1:-1].**split(', ') # I want this to be a list of
>> integers, NOT list of strings
>>
>> OUTPUT:
>>
>> ID= 1 ListInRecord= ['2', '9']
>> ID= 2 ListInRecord= ['1', '9']
>> ID= 3 ListInRecord= ['2']
>> ID= 9 ListInRecord= ['']
>>
>>
>
> --
> DaveA
> ______________________________**_________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/**mailman/listinfo/tutor<http://mail.python.org/mailman/listinfo/tutor>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20130304/febd9b08/attachment.html>
More information about the Tutor
mailing list