joining files

Mon May 17 03:16:25 EDT 2010

mannu jha wrote:
> On Sun, 16 May 2010 13:52:31 +0530  wrote
>   
>> mannu jha wrote:
>>     
>
>   
>> Hi,
>>     
>
>   
>
>   
>> I have few files like this:
>> file1:
>> 22 110.1 
>> 33 331.5 22.7 
>> 5 271.9 17.2 33.4
>> 4 55.1 
>>     
>
>   
>> file1 has total 4 column but some of them are missing in few row.
>>     
>
>   
>> file2:
>> 5 H
>> 22 0
>>     
>
>   
>> file3:
>> 4 T
>> 5 B
>> 22 C
>> 121 S
>>     
>
>   
>
>   
>> in all these files first column is the main source of matching their entries. So What I want in the output is only those entries which is coming in all three files.
>>     
>
>   
>> output required:
>>     
>
>   
>> 5 271.9 17.2 33.4 5 H 5 T
>> 22 110.1     22 0 22 C
>>     
>
>   
> I am trying with this :
>
> from collections import defaultdict
>
> def merge(sources):
>     blanks = [blank for items, blank, keyfunc in sources]
>     d = defaultdict(lambda: blanks[:])
>     for index, (items, blank, keyfunc) in enumerate(sources):
>         for item in items:
>             d[keyfunc(item)][index] = item
>     for key in sorted(d):
>         yield d[key]
>
> if __name__ == "__main__":
>     a = open("input1.txt")
>     
>     c = open("input2.txt")
>
>     def key(line):
>         return line[:2]
>     def source(stream, blank="", key=key):
>         return (line.strip() for line in stream), blank, key
>     for m in merge([source(x) for x in [a,c]]):
>         print "|".join(c.ljust(10) for c in m)
>
> but with input1.txt:
> 187    7.79   122.27   54.37   4.26   179.75
> 194    8.00   121.23   54.79   4.12   180.06
> 15    8.45   119.04   55.02   4.08   178.89
> 176    7.78   118.68   54.57   4.20   181.06
> 180    7.50   119.21   53.93      179.80
> 190    7.58   120.44   54.62   4.25   180.02
> 152    8.39   120.63   55.10   4.15   179.10
> 154    7.79   119.62   54.47   4.22   180.46
> 175    8.42   120.50   55.31   4.04   180.33
> and input2.txt:
>  15   H 
>  37   H 
>  95   T
> 124   H 
> 130   H 
> 152   H 
> 154   H 
> 158   H 
> 164   H
> 175   H 
> 176   H 
> 180   H
> 187   H 
> 190   T
> 194   C
> 196   H 
> 207   H 
> 210   H 
> 232   H 
> it is giving output as:
>           |
>           |124   H
>           |130   H
> 154    7.79   119.62   54.47   4.22   180.46|158   H
>           |164   H
> 175    8.42   120.50   55.31   4.04   180.33|176   H
> 180    7.50   119.21   53.93      179.80|187   H
> 190    7.58   120.44   54.62   4.25   180.02|196   H
>           |207   H
>           |210   H
>           |232   H
>           |37   H
>           |95   T
> so it not matching it properly, can anyone please suggest where I am doing mistake.
>
>
>   
Several mistakes here, some in making it unnecessarily complex, but I'll 
concentrate on the ones that just don't work.

Your key() function returns the first two characters of the line.  So 
you're keying not on the whole number, but only on the first two digits 
of it.  To find out what's going on, you need to decompose the complex 
line from:

            d[keyfunc(item)][index] = item

to some things you can actually examine:
                       key = keyfunc(item)
                       d[key][index] = item

You don't make any check to see if a particular item is in all the 
files.  For your particular data structure, this would mean that a 
particular value in the dictionary (which is a list of two items) has 
all non-blank strings in it.  To do this, you might want to do an all() 
function on the list.

DaveA