[Tutor] longest common substring

Fri Nov 11 14:10:37 CET 2011

On 2011-11-11 05:14, lina wrote:
> def LongestCommonSubstring(S1, S2):
>      M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))] ## creat 4*5 matrix
>      longest, x_longest = 0, 0
>      for x in xrange(1,1+len(S1)):                 ## read each row
>          for y in xrange(1,1+len(S2)):             ## read each coloumn
>              if S1[x-1] == S2[y-1]:
>                  M[x][y] = M[x-1][y-1]+1
>                  if M[x][y]>  longest:
>                      longest = M[x][y]
>                      x_longest = x
>                  else:
>                      M[x][y] = 0
>      return S1[x_longest-longest:x_longest]

That's still not the right version.

If you compare your version to the one at wikibooks ( 
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python 
), you'll see that the else-branch is wrongly indented (one level too 
deep). It belongs to the first if-comparison:

if S1 ...
      M[x][y] ...
      if M[x][y] ...
	...
else: ...

> if __name__=="__main__":
>
>      a=open("atom-pair_4.txt","r").readline().strip()
>
>      b=open("atom-pair_8.txt","r").readline().strip()
>
>      print(LongestCommonSubstring(LongestCommonSubstring(a,a),LongestCommonSubstring(b,b)))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
??? What do you try to accomplish here ???
You call "LongestCommonSubstring" with identical strings, thus the 
result must be the same string.
Why not

print(LongestCommonSubstring(a, b))

as you did the line below with "c" and "d"?

Further more I think that your problems start with bad data files. In 
every file there is just one very long line which looks like a string 
representation of a list of two-digits strings. This complicates further 
processing because you have to deal with all the unnecessary commas, 
blanks and single quotes between your numbers and the square brackets at 
the beginning and the end of the line.

>   $ python3 LongestCommonSubstring.py
> 2189
> ['
> ['82']
>
> The results are wrong.
> c, d are the string from file atom-pair_4,txt, exactly the same as a,
> d is the same as b.
>
> and even for (c,d) results are not correct, visually we can see some
> similar groups, not mention the longest groups.

And even if you use the correct function from wikibooks I can anticipate 
another problem :-)
The implementation from wikibooks just returns the first common 
substring which it finds in the first string:

 >>> WikibooksLongestCommonSubstring("ABAB","BABA")
'ABA'
 >>> WikibooksLongestCommonSubstring("BABA", "ABAB")
'BAB'

If there are more possible substrings with the same length (as in the 
example above) only the first one is returned.

But in your example there are at least two different pathways (I've 
found three) which have the same length, as changing the order of the 
parameters will show you:

 >>> WikibooksLongestCommonSubstring(c, d)
['61', '70', '61']
 >>> WikibooksLongestCommonSubstring(d, c)
['83', '61', '83']

Bye, Andreas