[Tutor] Understanding a linear runtime implementation of anagram detection

Thu Dec 10 08:45:13 EST 2015

Spyros Charonis wrote:

> Dear All,
> 
> I am learning about analysis of algorithms (python 2.7.6). I am reading a
> book (Problem solving with Algorithms and Data Structures) where Python is
> the language used for implementations. The author introduces algorithm
> analysis in a clear and understandable way, and uses an anagram detection
> program as a template to compare different runtime implementations
> (quadratic, log linear, linear). In the linear, and most efficient
> implementation, the code is as follows (comments added by me):
> 
> def anagram_test2(s1,s2):""" Checks if two strings are anagrams of each
> other
>     Runs with O(n) linear complexity """
> if (not s1) or (not s2):
>     raise TypeError, "Invalid input: input must be string"
>     return None
> # Initialize two lists of counters
> c1 = [0] * 26
> c2 = [0] * 26
> # Iterate over each string# When a char is encountered, # increment
> the counter at # its correspoding position   for i in range(len(s1)):
>     pos = ord(s1[i]) - ord("a")
>     c1[pos] += 1
> for i in range(len(s2)):
>     pos = ord(s2[i]) - ord("a")
>     c2[pos] += 1
> 
> j = 0
> hit = Truewhile j < 26 and hit:
>     if c1[j] == c2[j]:
>         j += 1
>     else:
>         hit = False
> return hit

The code above will make any Python programmer who is past the absolute 
beginner level cry. Here's a somewhat cleaned-up version:

OFFSET = ord("a")

def anagram_test3(s1, s2):
    if len(s1) != len(s2):
           return False
    freq1 = [0] * 26
    for c in s1:
        freq1[ord(c) - OFFSET] += 1
    freq2 = [0] * 26
    for c in s2:
        freq2[ord(c) - OFFSET] += 1
    return freq1 == freq2

When you look a the code you see that almost that the same code appears 
twice. Let's move it into a separate function:

OFFSET = ord("a")

def freq(s):
    freq_table = [0] * 26
    for c in s:
        freq_table[ord(c) - OFFSET] += 1
    return freq_table

def anagram_test3(s, t):
    if len(s) != len(t):
           return False
    return freq(s) == freq(t)

> My questions are:
> 
> 1)
> Is it computationally more/less/equally efficient to use an explicit while
> loop as it is to just do "return c1 === c2" (replacing the final code
> block following the two for loops). I realize that this single line of
> code performs an implicit for loop over each index to test for equality.
> My guess is that because in other languages you may not be able to do this
> simple test, the author wanted to present an example that could be adapted
> for other languages, unless the explicit while loop is less expensive
> computationally.

Yes, while the simple c1 == c2 has the same complexity it should be 
significantly faster than the loop. The equality check is also easy to read 
and understand. 

> 2)
> How could I go about adapting this algorithm for multiple strings (say I
> had 20 strings and wanted to check if they are anagrams of one another).

With the above freq() function you can calculate the frequency table for the 
first string and then loop over the rest and stop when you encounter one 
that produces a different frequency table.

Somewhat more involved: You might calculate frequency tables for every 
string and then collect them in a dict. For that to work you have to convert 
the lists into something hashable -- a tuple will be the obvious choice. The 
final table will look like this

{
(1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
0): ['jeans'], 
(1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 
0): ['stingray', 'straying'], 
(1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 
0): ['parleys', 'parsley', 'players', 'replays', 'sparely']}

and to find the anagrams you have to pick the lists containing more than one 
word.