[Tutor] Understanding a linear runtime implementation of anagram detection
Peter Otten
__peter__ at web.de
Thu Dec 10 08:45:13 EST 2015
Spyros Charonis wrote:
> Dear All,
>
> I am learning about analysis of algorithms (python 2.7.6). I am reading a
> book (Problem solving with Algorithms and Data Structures) where Python is
> the language used for implementations. The author introduces algorithm
> analysis in a clear and understandable way, and uses an anagram detection
> program as a template to compare different runtime implementations
> (quadratic, log linear, linear). In the linear, and most efficient
> implementation, the code is as follows (comments added by me):
>
> def anagram_test2(s1,s2):""" Checks if two strings are anagrams of each
> other
> Runs with O(n) linear complexity """
> if (not s1) or (not s2):
> raise TypeError, "Invalid input: input must be string"
> return None
> # Initialize two lists of counters
> c1 = [0] * 26
> c2 = [0] * 26
> # Iterate over each string# When a char is encountered, # increment
> the counter at # its correspoding position for i in range(len(s1)):
> pos = ord(s1[i]) - ord("a")
> c1[pos] += 1
> for i in range(len(s2)):
> pos = ord(s2[i]) - ord("a")
> c2[pos] += 1
>
> j = 0
> hit = Truewhile j < 26 and hit:
> if c1[j] == c2[j]:
> j += 1
> else:
> hit = False
> return hit
The code above will make any Python programmer who is past the absolute
beginner level cry. Here's a somewhat cleaned-up version:
OFFSET = ord("a")
def anagram_test3(s1, s2):
if len(s1) != len(s2):
return False
freq1 = [0] * 26
for c in s1:
freq1[ord(c) - OFFSET] += 1
freq2 = [0] * 26
for c in s2:
freq2[ord(c) - OFFSET] += 1
return freq1 == freq2
When you look a the code you see that almost that the same code appears
twice. Let's move it into a separate function:
OFFSET = ord("a")
def freq(s):
freq_table = [0] * 26
for c in s:
freq_table[ord(c) - OFFSET] += 1
return freq_table
def anagram_test3(s, t):
if len(s) != len(t):
return False
return freq(s) == freq(t)
> My questions are:
>
> 1)
> Is it computationally more/less/equally efficient to use an explicit while
> loop as it is to just do "return c1 === c2" (replacing the final code
> block following the two for loops). I realize that this single line of
> code performs an implicit for loop over each index to test for equality.
> My guess is that because in other languages you may not be able to do this
> simple test, the author wanted to present an example that could be adapted
> for other languages, unless the explicit while loop is less expensive
> computationally.
Yes, while the simple c1 == c2 has the same complexity it should be
significantly faster than the loop. The equality check is also easy to read
and understand.
> 2)
> How could I go about adapting this algorithm for multiple strings (say I
> had 20 strings and wanted to check if they are anagrams of one another).
With the above freq() function you can calculate the frequency table for the
first string and then loop over the rest and stop when you encounter one
that produces a different frequency table.
Somewhat more involved: You might calculate frequency tables for every
string and then collect them in a dict. For that to work you have to convert
the lists into something hashable -- a tuple will be the obvious choice. The
final table will look like this
{
(1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0): ['jeans'],
(1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0): ['stingray', 'straying'],
(1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
0): ['parleys', 'parsley', 'players', 'replays', 'sparely']}
and to find the anagrams you have to pick the lists containing more than one
word.
More information about the Tutor
mailing list