Help with Pyrex: how to declare char arrays?

John Machin sjmachin at lexicon.net
Mon Dec 30 17:17:59 EST 2002


Tim Churches <tchur at optushome.com.au> wrote in message news:<mailman.1041202716.17134.python-list at python.org>...
[snip]
> the following Pyrex programme? As it is, Pyrex converts the code to a C
> extension module successfully, but it runs slightly slower than the
> equivalent Python code, 

You have an algorithm that is O(n**2) overall and O(n) in calls to
basic Python functions -- this result is not unexpected.

> 
> #########################################
> def jaro(str1, str2):
[snip] 
>   len1 = len(str1)
>   len2 = len(str2)
>   halflen = max(len1, len2) / 2 + 1
> 
1. As far as I can tell from the papers that I have read that describe
Jaro's string comparator, you need min(), not max() ... e.g. Porter &
Winkler, about 2/3 down page 2: "The definition of common is that the
agreeing character must be within 1/2 the length of the shorter
string".

AND what they mean is (I believe) that the range searched should be
half the length of the shorter string -- i.e. start/end below should
be O(minus/plus a QUARTER of the shorter length). In this way, the
commonality of 't' in 'alphabet' and 'triumph' would be ignored. You
wouldn't want 'foobar' and 'raboof' to crack a score of 1.0, would
you? Have you seen actual source code from the U.S. Census folk? (I
haven't).

2. You may like to consider // instead of / in the above.

[snip]
>   # Analyse the first string 
>   #
>   for i in range(len1):
>     start = max(0,i-halflen)
>     end   = min(i+halflen+1,len2)
>     index = workstr2.find(str1[i],start,end)
>     if (index > -1):  # Found common character
>       common1 = common1 + 1
>       ass1 = ass1+str1[i]
>       workstr2 = workstr2[:index]+'*'+workstr2[index+1:]

This isn't helping speed much. Consider implementing workstr2 as an
array.array('c', ...). Then you can say workstr2[index] = '*'. However
there is no find method, you would have to use the index method and
trap the exception when the sought character is not found. Might be
slower.

You might like to define
NO_SUCH_CHAR = '*'.
You might like to solve the problem of all chars being possible by
using an already_found vector instead of the mutable workstr2 vector
approach. Might be faster.

HTH,
John



More information about the Python-list mailing list