Reading a large csv file

Mag Gam magawake at gmail.com
Wed Jun 24 07:38:11 EDT 2009


Sorry for the delayed response. I was trying to figure this problem
out. The OS is Linux, BTW


Here is some code I have:
import numpy as np
from numpy import *

import gzip
import h5py
import re
import sys, string, time, getopt
import os

src=sys.argv[1]
fs = gzip.open(src)
x=src.split("/")
filename=x[len(x)-1]

#Get YYYY/MM/DD format
YYYY=(filename.rsplit(".",2)[0])[0:4]
MM=(filename.rsplit(".",2)[0])[4:6]
DD=(filename.rsplit(".",2)[0])[6:8]

f=h5py.File('/tmp/test_foo/FE.hdf5','w')

grp="/"+YYYY
try:
  f.create_group(grp)
except ValueError:
  print "Year group already exists"

grp=grp+"/"+MM
try:
  f.create_group(grp)
except ValueError:
  print "Month group already exists"

grp=grp+"/"+DD
try:
  group=f.create_group(grp)
except ValueError:
  print "Day group already exists"


str_type=h5py.new_vlen(str)
mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
'f4', 'f4')}
print "Filename is: ",src
fs = gzip.open(src)

dset = f.create_dataset ('Foo',data=arr,compression='gzip')

s=0

#Takes the longest here
for y in fs:
     continue
  a=y.split(',')
  s=s+1
  dset.resize(s,axis=0)
fs.close()

f.close()


This works but just takes a VERY long time.

Any way to optimize this?

TIA


On Wed, Jun 24, 2009 at 12:13 AM, Chris Withers<chris at simplistix.co.uk> wrote:
> Terry Reedy wrote:
>>
>> Mag Gam wrote:
>>>
>>> Yes, the system has 64Gig of physical memory.
>>
>> drool ;-).
>
> Well, except that, dependent on what OS he's using, the size of one process
> may well still be limited to 2GB...
>
> Chris
>
> --
> Simplistix - Content Management, Zope & Python Consulting
>           - http://www.simplistix.co.uk
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list