difflib and intelligent file differences

Thu Mar 26 11:35:26 EDT 2009

If the lines are really sorted, all you really need is a merge, where 
you read one line from each source, and if equal, read another from 
each.  If one source is less, output the lesser line with appropriate 
tag , and refresh that one from its source.  Stop when either source has 
run out, and then flush the rest of the other source to the output, with 
appropriate tag.

Time is linear, and memory use negligible.

Marco Mariani wrote:
>
>
> You can adapt and use this, provided the files are already sorted. 
> Memory usage scales linearly with the size of the file difference, and 
> time scales linearly with file sizes.
>
>
>> #!/usr/bin/env python
>>
>> import sys
>>
>>
>> def run(fname_a, fname_b):
>>     filea = file(fname_a)
>>     fileb = file(fname_b)
>>     a_lines = set()
>>     b_lines = set()
>>
>>     while True:
>>         a = filea.readline()
>>         b = fileb.readline()
>>         if not (a or b):
>>             break
>>
>>         if a == b:
>>             continue
>>
>>         if a in b_lines:
>>             b_lines.remove(a)
>>         elif a:
>>             a_lines.add(a)
>>
>>         if b in a_lines:
>>             a_lines.remove(b)
>>         elif b:
>>             b_lines.add(b)
>>
>>
>>     for line in a_lines:
>>         print line
>>
>>     if a_lines or b_lines:
>>         print ''
>>         print '***************'
>>         print ''
>>
>>     for line in b_lines:
>>         print line
>>
>>
>> if __name__ == '__main__':
>>     run(sys.argv[1], sys.argv[2])
>>
>
> </div>
>