[Tutor] List and dictionary comprehensions

Mon Sep 29 16:14:51 CEST 2014

Armindo Rodrigues wrote:

> Hi everyone,
> 
> This is my first post so I don't know if I am asking the correct way so
> let me know if I messed anything up.
> 
> ***Please note. My code contains a list of quotes that has many lines. I
> have noted the beginning and end of the quotes list so you can easily skip
> and go straight to the code section. ***
> 
> 
> This is technically NOT a homework assignment. I am the teaching assistant
> for the Python course at my school. The teacher doesn't give me the
> homework ahead of time so I typically try the homework myself so I can
> help anyone else out that may be confused. The assignment has come and
> gone but I want to challenge myself with making this thing more efficient
> and learn comprehensions along the way.
> 
> The assignment was as follows:
> The teacher provided the class with a list of quotes called data_list. We
> were to implement a very simple search algorithm that took in a user's
> query and we searched based on those words. If the user entered an OR then
> we searched for quotes that contained either of the words. Any other
> combination of AND OR will be an AND search.
> 
> Once we completed the search algorithm, the assignment called for
> pre-processing that would make the search more efficient and faster. I
> created a dictionary based on each word in the quotes list as the key and
> then searched against those words. I greatly increased the search time.

That is very unlikely. Note that you include the print() calls into your 
measurement. These will dominate the time you measure, i. e. searches with 
more matching text will appear to take longer and searches with little or no 
text will appear to be fast.

> MY QUESTION:
> Can anyone look at this and explain how I could create a list
> comprehension and a dictionary comprehension if possible? 

I tried to solve the problem myself below -- and did not find a place where 
list comprehensions would fit in naturally.

> Also any
> suggestions with making the search faster would be appreciated.

As hinted above, your code does not have speed issues; you can simplify and 
clean it a bit (I stopped halfway), put every separate step into a function 
with an informative name, avoid global variables, follow the naming 
conventions of PEP 8 <http://python.org/dev/peps/pep-0008/> -- and that's 
it.

My most significant modifications:

- Use the strings directly instead of indices into the list
- Build the word-to-quote lookup directly without the intermediate list of
  all words.

#!/usr/bin/env python3
import re
import time
from itertools import zip_longest
from sys import exit

data_list = [...] # won't repeat your data

RED = "\033[31m"
BLUE = "\033[34m"

EMPTY = frozenset()

def colored(s, color):
    # if the coloring attempt messes up your screen
    # change this to
    # return s
    return "%s%s\033[0m" % (color, s)

def find(term):
    return quotes_dict.get(term.lower(), EMPTY)

def mark_terms(phrase, terms):
    def mark(m):
        return colored(m.group(), RED)

    expr = r"\b(%s)\b" % "|".join(terms)
    return re.compile(expr).sub(mark, phrase)

def clean(word):
    return re.sub('^[^a-zA-z]*|[^a-zA-Z]*$', '', word)

quotes_dict = {}
for item in data_list:
    for word in item.split():
        quotes_dict.setdefault(clean(word), set()).add(item)

query = input("query: ")

start_time = time.time()

query = query.split()
if not query:
    exit("no search term")

terms = query[::2]
found_quotes = find(query[0])
for operator, term in zip_longest(query[1::2], query[2::2]):
    if term is None:
        exit("dangling operator " + operator)
    operator = operator.lower()
    if operator == "or":
        found_quotes |= find(term)
    elif operator == "and":
        found_quotes &= find(term)
    else:
        print("unknown operator", operator)
        exit(1)

print(
    "\n--- Execution ---\n",
    (time.time() - start_time) * 1000, "microseconds\n")

for index, quote in enumerate(found_quotes):
    print(colored("{} >>>".format(index), BLUE), mark_terms(quote, terms))

Note that my handling of the operators does not follow your teacher's spec. 
It simply `or`s or `and`s the next match to what is in the current set of 
matches.