[Tutor] how to select a column from all files in one directory?

Kent Johnson kent37 at tds.net
Tue Nov 23 19:13:34 CET 2004


To do this you have to remember the data from each file column in a 
list, then combine the lists. The built-in function map() will combine 
corresponding list elements. (So will zip, but it terminates when it 
hits the end of the first list. map will continue until all the lists 
have been exhausted.)

Here is a fairly straightforward way to do it. This requires that the 
data from column 4 of all the files will fit in memory at once.

#################################
path = "C:\\Documents and Settings\\data\\"
files = os.listdir(path)

listOfColumns = []  # This will contain one list for each file, with the 
contents of column 4

for inFile in files:
     columnData = []
     f = file(path + inFile,'r')
     for line in f:
         b = line.split('\t')
         if len(b)>=3:
             columnData.append(b[3])
         else:
             # Row data is too short
             columnData.append('Error')
     f.close()

     listOfColumns.append(columnData)

# This creates a list of row lists from the list of column lists
# If any of the column lists are too short they will be padded with None
rows = map(None, *listOfColumns)

out = file('output.txt','w')
     for row in rows:
     out.write('\t'.join(row))
     out.write('\n')
out.close()
###########################

Here is a version that uses itertools.izip() to iterate over all the 
files at once. This only requires enough memory to hold one row from 
each file at the same time (plus 40 open file objects). This will stop 
when it hits the end of the file with the fewest lines.

###################################3
from itertools import izip

def get3(line):
     ''' A function to extract the fourth column from a line of data '''
     data = line.split('\t')
     try:
         return data[3]
     except IndexError:
         return 'Error'

path = "C:\\Documents and Settings\\data\\"

# This makes a list of all the open files
files = map(open, [path+file for file in os.listdir(path)])

out = file('output.txt','w')

# This iterates all the files in parallel
for rows in izip(*files):
     # Rows is now a list with one line of data from each open file
     data = [get3(row) for row in rows]
     out.write('\t'.join(data))
     out.write('\n')

out.close()
##############################

Warning: I haven't actually tested either of these!
Kent

kumar s wrote:
> Dear Kent and John, 
>   thank you very much for your solutions. It is
> working, however, there is a problem in printing the
> output in a proper way:
> 
>  
> 
>>>>path = "C:\\Documents and Settings\\data\\"
>>>>files = os.listdir(path)
>>>>out = file('output.txt','w')
>>>>for inFile in files:
> 
> 	f = file(path + inFile,'r')
> 	for line in f:
> 		b = line.split('\t')
> 		if len(b)>=3:
> 			out1.write(b[3] + '\n' )
> 	f.close()
> 
> 
> 
> Here the output files has the 4th column from all the
> files in one single column. 
> 
> I wanted to have 40 columns (4th column of 40 files)
> rather instead of one column. 
> 
> I do not no how to ask the loop to print the contents
> of 4th column in second file after a tab. 
> 
> file1     file2     file3    file4    file5
> 1.4        34.5      34.2     567.3    344.2
> 34.3       21.3       76.2     24.4     34.4'
> ...         ...       ....     .....    .....
> 
> 
> could you please suggest. 
> 
> thanks
> kumar.
>  
> 
> 
> 
> --- Kent Johnson <kent37 at tds.net> wrote:
> 
> 
>>This will only output the data from the first line
>>of each file in the 
>>source directory. If the files have multiple lines,
>>you need another 
>>loop. see below.
>>
>>Also you can make a string with a tab in it with
>>'\t', maybe more 
>>readable than this: '	'
>>
>>Kent
>>
>>John Purser - Gmail wrote:
>>
>>>Morning Kumar,
>>>
>>>This worked for me:
>>>###Code Start###
>>>import os
>>>path = 'C:\\Documents and Settings\\WHOEVER\\My
>>
>>Documents\\temp\\'
>>
>>>files = os.listdir(path)
>>>out = file('output.txt', 'w')
>>>
>>>for inFile in files:
>>>	f = file(path + inFile, 'r')
>>
>>	for line in f:
>>	 	b = line.split('	')	#There's a tab in there, not
>>spaces
>>	 	if len(b) >= 3:
>>	 		out.write(b[3] + '\n')
>>
>>>	f.close()
>>>
>>>out.close()
>>>
>>>###Code End###
>>>
>>>
>>>I'd give the output file an output directory of
>>
>>it's own to be sure I didn't
>>
>>>clobber anything and of course those are windows
>>
>>directory seperators.
>>
>>>John Purser
>>>
>>>-----Original Message-----
>>>From: tutor-bounces at python.org
>>
>>[mailto:tutor-bounces at python.org]On
>>
>>>Behalf Of kumar s
>>>Sent: Tuesday, November 23, 2004 05:38
>>>To: tutor at python.org
>>>Subject: [Tutor] how to select a column from all
>>
>>files in one directory?
>>
>>>
>>>Dear group,
>>> I have ~40 tab-delimitted text files in one
>>>directory. I have to select 4th column from all
>>
>>these
>>
>>>files and write it into a single tab delimitted
>>
>>text
>>
>>>file. How can I do this in python.
>>>
>>>I read a prev. post on tutor and tried the
>>
>>following:
>>
>>>
>>>>>files = listdir('path')
>>>
>>>
>>>This returns the files(file names) in the
>>
>>directory as
>>
>>>a list. However, I need the 4th column from all
>>
>>the
>>
>>>files in the directory.
>>>
>>>Can you please help me.
>>>
>>>Thank you.
>>>
>>>-kumar
>>>
>>>__________________________________________________
>>>Do You Yahoo!?
>>>Tired of spam?  Yahoo! Mail has the best spam
>>
>>protection around
>>
>>>http://mail.yahoo.com
>>>_______________________________________________
>>>Tutor maillist  -  Tutor at python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>>>
>>>_______________________________________________
>>>Tutor maillist  -  Tutor at python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>>>
>>
>>_______________________________________________
>>Tutor maillist  -  Tutor at python.org
>>http://mail.python.org/mailman/listinfo/tutor
>>
> 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> The all-new My Yahoo! - Get yours free! 
> http://my.yahoo.com 
>  
> 
> 


More information about the Tutor mailing list