please help with optimisation of this code - update of given table according to another table

Wed Nov 8 08:07:07 EST 2006

In <1162981094.820174.65440 at b28g2000cwb.googlegroups.com>, Farraige wrote:

> Let's say we have a table T1:
> 
> A B C D E
> ---------------
> 1 4  5  7 7
> 3 4  0  0 0
> 
> and we call a method mergeTable(T1, T2, [0,1], [2,4])
> 
> It means that we would like to update columns C and E of table T1 with
> data from table T2 but only in case the key columns A and B are equal
> in both tables.... I grant that the given key is unique in both tables
> so if I find a row with the same key in table T2 I do merging, stop and
> go to next row in table T1...
> 
> Let's say T2 looks following:
> 
> A B C D E
> ---------------
> 2 2  8 8 8
> 1 4  9 9 9
> 
> So after execution of our   mergeTable method, the table T1 should look
> like :
> 
> A B C D E
> 1 4  9  7 9
> 3 4  0 0  0
> 
> The 2nd row ['3', '4',  '0' ,'0',  '0'] didn't change because there was
> no row in table T2 with key = 3 ,4
> 
> The main part of my algorithm now looks something like ...
> 
> merge(t1, t2, keyColumns, columnsToBeUpdated)
> 
> .......
> 
>         for row_t1 in t1:
>             for  row_t2 in t2:
>                 if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
> in keyColumns]:
>                     # the keys are the same
>                     for colName in columnsToBeUpdated:
>                         row_t1[colName] = row_t2[colName]
> 
>                     # go outside the inner loop - we found a row with
>                     # the same key in the table
>                     break
> 
> In my algorithm I have 2 for loops and I have no idea how to optimise
> it (maybe with map? )
> I call this method for very large data and the performance is a
> critical issue for me :(

Just go through the first table once and build a mapping key->row and then
go through the second table once and look for each row if the key is in
the mapping.  If yes: update columns.  This runs in O(2*rows) instead if
O(rows**2).

def update_table(table_a, table_b, key_columns, columns_to_be_updated):
    def get_key(row):
        return tuple(row[x] for x in key_columns)

    key2row = dict((get_key(row), row) for row in table_a)
    for row in table_b:
        row_to_be_updated = key2row.get(get_key(row))
        if row_to_be_updated is not None:
            for column in columns_to_be_updated:
                row_to_be_updated[column] = row[column]

def main():
    table_a = [[1, 4, 5, 7, 7],
               [3, 4, 0, 0, 0]]
    table_b = [[2, 2, 8, 8, 8],
               [1, 4, 9, 9, 9]]
    update_table(table_a, table_b, (0, 1), (2, 4))
    for row in table_a:
        print row

Ciao,
	Marc 'BlackJack' Rintsch