[Tutor] longest common substring

Andreas Perstinger andreas.perstinger at gmx.net
Sun Nov 13 23:31:32 CET 2011


On 2011-11-11 16:53, Jerry Hill wrote:
> There's nothing wrong with writing your own code to find the longest common
> substring, but are you aware that python has a module in the standard
> library that already does this?  In the difflib module, the SequenceMatcher
> class can compare two sequences and extract the longest common sequence of
> elements from it, like this:

Thanks for the tip. I've played around with it, but I think it doesn't 
help in the OP's situation. "SequenceMatcher.find_longest_match()" just 
finds the first common block:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import difflib
 >>> first = [0, 1, 2, 3, 0, 4, 5, 6, 0]
 >>> second = [1, 2, 3, 4, 5, 6]
 >>> match = difflib.SequenceMatcher(None, first, second)
 >>> match.find_longest_match(0, len(first), 0, len(second))
Match(a=1, b=0, size=3)

Here it returns just [1, 2, 3] but misses [4, 5, 6]. So you would have 
to adjust the lower limits to get it. 
"SequenceMatcher.get_matching_blocks()" seems to be a better choice:

 >>> match.get_matching_blocks()
[Match(a=1, b=0, size=3), Match(a=5, b=3, size=3), Match(a=9, b=6, size=0)]

Now you get [1, 2, 3] and [4, 5, 6]. But if the two blocks are in the 
reversed order, there is no longest common subsequence [1, 2, 3, 4, 5, 
6] any more and "SequenceMatcher" only finds one part (apparently it 
chooses the first it comes across in the first list if both have the 
same length):

 >>> first = [0, 1, 2, 3, 0, 4, 5, 6, 0]
 >>> second = [4, 5, 6, 1, 2, 3]
 >>> match = difflib.SequenceMatcher(None, first, second)
 >>> match.find_longest_match(0, len(first), 0, len(second))
Match(a=1, b=3, size=3)
 >>> match.get_matching_blocks()
[Match(a=1, b=3, size=3), Match(a=9, b=6, size=0)]

 From both methods you get [1, 2, 3].

As I've learnt during this tests, there is a difference between 
subsequences and substrings:
http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence

If I've understood the OP right, he/she wants to find all common 
substrings with a minimum length regardless of their order in the strings.

Bye, Andreas


More information about the Tutor mailing list