[Tutor] Iterating through a list of replacement regex patterns

Sat Sep 4 03:12:27 CEST 2010

On Fri, 3 Sep 2010 12:24:00 pm David Hutto wrote:
> In the below function I'm trying to iterate over the lines in a
> textfile, and try to match with a regex pattern that iterates over
> the lines in a dictionary(not {}, but turns a text list of
> alphabetical words into a list using readlines()).
>
> def regexfiles(filename):
[...]

Your function does too much. It:

1 manages a file opened for input and output;
2 manages a dictionary file;
3 handles multiple searches at once (findall and search);
4 handles print output;

and you want it to do more:

5 handle multiple "select" terms.

Your function is also poorly defined -- I've read your description 
repeatedly, and studied the code you give, and your three follow up 
posts, and I still don't have the foggiest clue of what it's supposed 
to accomplish! You do two searches on each iteration, but other than 
print the results, you don't do anything with it.

What is the *meaning* of the function? "regexfiles" is a meaningless 
name, and your description "I'm trying to iterate over the lines in a 
textfile, and try to match with a regex pattern that iterates over the 
lines in a dictionary" is confusing -- the first part is fine, but what 
do you mean by a regex that iterates over the lines in a dictionary?

What is the purpose of a numeric variable called "search"? It looks like 
a counter to me, not a search, it is incremented each time through the 
loop. The right way to do that is with enumerate(), not a manual loop 
variable.

Why do you call readlines() instead of read()? This makes no sense to 
me. You then convert a list of strings into a single string like this:

readlines() returns ['a\n', 'b\n', 'c\n']
calling str() returns "['a\n', 'b\n', 'c\n']"

but the result includes a lot of noise that weren't in the original 
file: open and close square brackets, commas, quotation marks.

I think perhaps you want ''.join(readlines()) instead, but even that is 
silly, because you should just call read() and get 'a\nb\nc\n'.

You should split this up into multiple functions. It will make 
comprehension, readability, debugging and testing all much, much 
easier. Code should be self-documenting -- ideally you will never need 
to write a comment, because what the code does should be obvious from 
your choice of function and variable names.

I don't understand why you're using regex here. If I'm guessing 
correctly, the entries in the dictionary are all ordinary (non-regex) 
strings, like:

ape
cat
dog

only many, many more words :)

Given that, using the re module is like using a bulldozer to crack a 
peanut, and about as efficient. Instead of re.search(target, text) you 
should just use text.find(target). There's no built-in equivalent to 
re.findall, but it's not hard to write one:

def findall(text, target):
    results = []
    start = 0
    p = text.find(target)
    while p != -1:
        results.append(p)
        p = text.find(target)
    return results

(This returns a list of starting positions rather than words. It's easy 
to modify to do the other -- just append target instead of p.)

Anyway, here's my attempt to re-write your function. I've stuck to 
regexes just in case, and there's lots of guess-work here, because I 
don't understand what you're trying to accomplish, but here goes 
nothing:

# Change this to use real English  *wink*
DEFAULT_DICT =  '/var/www/main/american-english'

def get_dict_words(filename=''):
    """Return a list of words from the given dictionary file.

    If not given, a default dictionary is used.
    The format of the file should be one word per line.
    """
    if not filename:
        filename = DEFAULT_DICT
    # Don't use file, use open.
    dictionary = open(filename)
    words = dictionary.readlines()
    # Best practice is to explicitly close files when done.
    dictionary.close()
    return words

def search(target, text):
    """Return the result of two different searches for target in text.

    target should be a regular expression or regex object; text should
    be a string.

    Result returned is a two-tuple:
    (list of matches, regex match object or None)
    """
    if isinstance(target, str):
        target = re.compile(target)
    a = target.findall(text)
    b = target.search(text)
    return (a, b)

def print_stuff(i, target, text, a, b):
    # Boring helper function to do printing.
    print "counter = %d" % i
    print "target = %s" % target
    print "text = %s" % text
    print "findall = %s" % a
    print "search = %s" % b
    print

def main(text):
    """Print the results of many regex searches on text."""
    for i, word in get_dict_words():
        a, b, = search(word, text)
        print_stuff(i, word, text, a, b)

# Now call it:
fp = open(filename)
text = fp.read()
fp.close()
main(text)

I *think* this should give the same results you got.

-- 
Steven D'Aprano