How to compute a delta: the difference between lists of strings

Chris Angelico rosuav at gmail.com
Sat May 5 08:39:59 EDT 2012


On Sat, May 5, 2012 at 10:12 PM, J. Mwebaze <jmwebaze at gmail.com> wrote:
> This is out of curiosity, i know this can be done with python diffllib
> module, but been figuring out how to compute the delta, Consider two lists
> below.
>
> s1 = ['e', 'f', 'g', 'A', 'B', 'C', 'D', 'C']
> s2 =['e', 'A', 'B', 'f', 'g', 'C', 'D', 'z']
>
> This is the result should be
>
> ['  e', '+ A', '+ B', '  f', '  g', '- A', '- B', '  C', '  D', '- C', '+
> z']
>
> ideas on how to approach this.. ?

Here's a simple algorithm that produces sorta-mostly-reasonable
results most of the time:

Set your current-position-index to the beginning of each lists. (Call
them 'pos1' and 'pos2'.)
If s1[pos1] and s2[pos2] are identical, match - increment and iterate.
Otherwise, increment pos2 until either it matches pos1 or you reach
the end of s2.
If you find a match, report the insertion into s2, increment both
pointers past the match, and carry on.
If you hit the end of s2, increment pos1 once and report an insertion
into s1, then go back to looking for a match.

It helps to append a sentinel to each list; that way, you don't need
to separately check for additional content at the end of either list
(as the sentinel won't match any of the strings).

This is the algorithm I used for writing a simple file diff tool ages
ago. It's not as good as diff(1), but it was fun to do the exercise.
It's quite inefficient at handling large insertions into s1, and needs
to be modified for most file handling (for instance, requiring a
2-line or 3-line rematch after a difference, to avoid matching on
blank lines), but it's a basis.

Producing beautiful or minimal diffs is a more complex task. :)

ChrisA



More information about the Python-list mailing list