[Tutor] longest common substring
Andreas Perstinger
andreas.perstinger at gmx.net
Sun Nov 13 23:31:32 CET 2011
On 2011-11-11 16:53, Jerry Hill wrote:
> There's nothing wrong with writing your own code to find the longest common
> substring, but are you aware that python has a module in the standard
> library that already does this? In the difflib module, the SequenceMatcher
> class can compare two sequences and extract the longest common sequence of
> elements from it, like this:
Thanks for the tip. I've played around with it, but I think it doesn't
help in the OP's situation. "SequenceMatcher.find_longest_match()" just
finds the first common block:
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> first = [0, 1, 2, 3, 0, 4, 5, 6, 0]
>>> second = [1, 2, 3, 4, 5, 6]
>>> match = difflib.SequenceMatcher(None, first, second)
>>> match.find_longest_match(0, len(first), 0, len(second))
Match(a=1, b=0, size=3)
Here it returns just [1, 2, 3] but misses [4, 5, 6]. So you would have
to adjust the lower limits to get it.
"SequenceMatcher.get_matching_blocks()" seems to be a better choice:
>>> match.get_matching_blocks()
[Match(a=1, b=0, size=3), Match(a=5, b=3, size=3), Match(a=9, b=6, size=0)]
Now you get [1, 2, 3] and [4, 5, 6]. But if the two blocks are in the
reversed order, there is no longest common subsequence [1, 2, 3, 4, 5,
6] any more and "SequenceMatcher" only finds one part (apparently it
chooses the first it comes across in the first list if both have the
same length):
>>> first = [0, 1, 2, 3, 0, 4, 5, 6, 0]
>>> second = [4, 5, 6, 1, 2, 3]
>>> match = difflib.SequenceMatcher(None, first, second)
>>> match.find_longest_match(0, len(first), 0, len(second))
Match(a=1, b=3, size=3)
>>> match.get_matching_blocks()
[Match(a=1, b=3, size=3), Match(a=9, b=6, size=0)]
From both methods you get [1, 2, 3].
As I've learnt during this tests, there is a difference between
subsequences and substrings:
http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence
If I've understood the OP right, he/she wants to find all common
substrings with a minimum length regardless of their order in the strings.
Bye, Andreas
More information about the Tutor
mailing list