[Tutor] Script to collect values from .csv

Dave Angel d at davea.name
Thu Jul 12 23:27:05 CEST 2012


On 07/12/2012 12:29 PM, Emma Knowles wrote:
> Hi all,
>
>
> I have a very large .csv (correlationfile, which is 16 million lines
> long) which I want to split into smaller .csvs. The smaller csvs
> should be created be searching for a value and printing any line
> which contains that value - all these values are contained in another
> .csv (vertexfile). I think that I have an indentation problem or have
> made a mistake with my loops because I only get data in one of the
> output .csvs (outputfile) which is for the first one of the values.
> The other .csvs are empty.
>
>
> Can somebody help me please?
>
>
> Thanks so much!
>
>
> Emma
>
>
> import os
>
> path = os.getcwd()
>
> vertexfile = open(os.path.join(path,'vertices1.csv'),'r')
>
> correlationfile = open(os.path.join(path,'practice.csv'),'r')
>
> x = ''
>
> for v in vertexfile:
>
> vs = v.replace('\n','')
>
> outputfile = open(os.path.join(path,vs+'.csv'),'w')
>
> for c in correlationfile:
>
> cs = c.replace('\n','').split(',')
>
> if vs == cs[0]: print vs
>
> outputfile.write(x)
>
> outputfile.close()
>
>

Welcome to the Python list.

There are a number of problems with your script.  I suggest writing
simpler functions, rather than trying to accomplish the whole thing in
one double loop.  I also suggest testing the pieces, instead of trying
to observe behavior of the whole thing.

It's a bit hard to observe indentation since you sent it as an html
message.  Please use plain text messages.  I can find out what you
really intended by looking at the raw message, but that shouldn't be
necessary, and not everyone can do that.

I'm amazed that you see any data in the output files, since you only
write null strings to it.  Variable x is "" and doesn't change, as far
as I can see.

You say vertexfile is a csv, but I don't see any effort to parse it. 
You just assume that each line is the key value.

In the loop  "for c in correlationfile"  you don't write to any file. 
All you do is print the key for each line which "matches" it.  Instead
of printing, you probably meant to do a write(), and instead of writing
vs you probably meant to write c.

As Prasad said, your close is indented wrong.  But that's not causing
your symptoms, as CPython will close your file for you when you open
another one and bind it to outputfile, next time through the loop. 
Still, it's useful to get it right, because the problems can be subtle.

I second the advice to use the csv module.  That's what it's there for. 
But if you know there are no embedded commas in the zeroth field, it'll
probably work okay without.


-- 

DaveA



More information about the Tutor mailing list