how to find difference in number of characters

Peter Otten __peter__ at web.de
Sat Oct 9 05:45:30 EDT 2010


harryos wrote:

> I am trying to write a compare method which takes two strings and find
> how many characters have changed.
> 
> 
> def compare_strings(s1,s2):
>     pass
> 
> 
> text1="goat milk"
> text2="cow milk"
> print compare_strings(text1,text2)
> 
> This must give 3 ,since 3 characters are changed between strings.I was
> advised to use levenshtein algorithm ..but then the matrix ops take a
> long time for nontrivial strings of say 20000 characters ..Can this

I tried it with a string of 20000 chars and the python-levenshtein package 
that comes with ubuntu. It took about one second to calculate the distance:

import functools
import random
import string
import time

from Levenshtein import distance

def make_strings(n, delete, insert, swap, replace, 
charset=string.ascii_letters):
    def gen_chars():
        while True:
            yield random.choice(charset)
    chars = gen_chars()
    a = [next(chars) for i in xrange(n)]
    s = "".join(a)
    for i in range(delete):
        del a[random.randrange(len(a))]
    for i in range(insert):
        a.insert(random.randrange(len(a)+1), next(chars))
    for i in range(swap):
        p = random.randrange(len(a)-1)
        a[p], a[p+1] = a[p+1], a[p]
    for i in range(replace):
        a[random.randrange(len(a))] = next(chars)
    t = "".join(a)
    return s, t

N = 20000
M = 100
ms = functools.partial(make_strings, N, M, M, M, M)

def measure(f, *args):
    start = time.time()
    try:
        return f(*args)
    finally:
        print time.time() - start

if __name__ == "__main__":
    import sys
    args = sys.argv[1:]
    if args:
        N, M = map(int, args)

    s, t = make_strings(N, M, M, M, M)
    print measure(distance, s, t)

$ python levenshtein_demo.py 10000 1000
0.225363969803
3644
$ python levenshtein_demo.py 20000 1000
1.05217313766
4197
$ python levenshtein_demo.py 30000 1000
2.38736391068
4390
$ python levenshtein_demo.py 40000 1000
4.1686527729
4558

What would be an acceptable time?

Peter



More information about the Python-list mailing list