[Tutor] longest common substring
Andreas Perstinger
andreas.perstinger at gmx.net
Fri Nov 11 14:10:37 CET 2011
On 2011-11-11 05:14, lina wrote:
> def LongestCommonSubstring(S1, S2):
> M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))] ## creat 4*5 matrix
> longest, x_longest = 0, 0
> for x in xrange(1,1+len(S1)): ## read each row
> for y in xrange(1,1+len(S2)): ## read each coloumn
> if S1[x-1] == S2[y-1]:
> M[x][y] = M[x-1][y-1]+1
> if M[x][y]> longest:
> longest = M[x][y]
> x_longest = x
> else:
> M[x][y] = 0
> return S1[x_longest-longest:x_longest]
That's still not the right version.
If you compare your version to the one at wikibooks (
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python
), you'll see that the else-branch is wrongly indented (one level too
deep). It belongs to the first if-comparison:
if S1 ...
M[x][y] ...
if M[x][y] ...
...
else: ...
> if __name__=="__main__":
>
> a=open("atom-pair_4.txt","r").readline().strip()
>
> b=open("atom-pair_8.txt","r").readline().strip()
>
> print(LongestCommonSubstring(LongestCommonSubstring(a,a),LongestCommonSubstring(b,b)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
??? What do you try to accomplish here ???
You call "LongestCommonSubstring" with identical strings, thus the
result must be the same string.
Why not
print(LongestCommonSubstring(a, b))
as you did the line below with "c" and "d"?
Further more I think that your problems start with bad data files. In
every file there is just one very long line which looks like a string
representation of a list of two-digits strings. This complicates further
processing because you have to deal with all the unnecessary commas,
blanks and single quotes between your numbers and the square brackets at
the beginning and the end of the line.
> $ python3 LongestCommonSubstring.py
> 2189
> ['
> ['82']
>
> The results are wrong.
> c, d are the string from file atom-pair_4,txt, exactly the same as a,
> d is the same as b.
>
> and even for (c,d) results are not correct, visually we can see some
> similar groups, not mention the longest groups.
And even if you use the correct function from wikibooks I can anticipate
another problem :-)
The implementation from wikibooks just returns the first common
substring which it finds in the first string:
>>> WikibooksLongestCommonSubstring("ABAB","BABA")
'ABA'
>>> WikibooksLongestCommonSubstring("BABA", "ABAB")
'BAB'
If there are more possible substrings with the same length (as in the
example above) only the first one is returned.
But in your example there are at least two different pathways (I've
found three) which have the same length, as changing the order of the
parameters will show you:
>>> WikibooksLongestCommonSubstring(c, d)
['61', '70', '61']
>>> WikibooksLongestCommonSubstring(d, c)
['83', '61', '83']
Bye, Andreas
More information about the Tutor
mailing list