Reading a large csv file

Lie Ryan lie.1296 at gmail.com
Wed Jun 24 15:57:09 EDT 2009


Mag Gam wrote:
> Sorry for the delayed response. I was trying to figure this problem
> out. The OS is Linux, BTW

Maybe I'm just being pedantic, but saying your OS is Linux means little
as there are hundreds of variants (distros) of Linux. (Not to mention
that Linux is a kernel, not a full blown OS, and people in GNU will
insist to call Linux-based OS GNU/Linux)

> Here is some code I have:
> import numpy as np
> from numpy import *

Why are you importing numpy twice as np and as *?

> import gzip
> import h5py
> import re
> import sys, string, time, getopt
> import os
> 
> src=sys.argv[1]
> fs = gzip.open(src)
> x=src.split("/")
> filename=x[len(x)-1]
> 
> #Get YYYY/MM/DD format
> YYYY=(filename.rsplit(".",2)[0])[0:4]
> MM=(filename.rsplit(".",2)[0])[4:6]
> DD=(filename.rsplit(".",2)[0])[6:8]

> 
> f=h5py.File('/tmp/test_foo/FE.hdf5','w')

this particular line would make it impossible to have more than one
instance of the program open. May not be your concern...

> 
> grp="/"+YYYY
> try:
>   f.create_group(grp)
> except ValueError:
>   print "Year group already exists"
> 
> grp=grp+"/"+MM
> try:
>   f.create_group(grp)
> except ValueError:
>   print "Month group already exists"
> 
> grp=grp+"/"+DD
> try:
>   group=f.create_group(grp)
> except ValueError:
>   print "Day group already exists"
> 

> str_type=h5py.new_vlen(str)

> mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
> 'f4', 'f4')}
> print "Filename is: ",src
> fs = gzip.open(src)

> dset = f.create_dataset ('Foo',data=arr,compression='gzip')

What is `arr`?

> s=0
> 
> #Takes the longest here
> for y in fs:
>      continue
>   a=y.split(',')

>   s=s+1
>   dset.resize(s,axis=0)

You increment s by 1 for each iteration, would this copy the dataset? (I
never worked with h5py, so I don't know how it works)



More information about the Python-list mailing list