[Tutor] Iterating through a list of replacement regex patterns
Steven D'Aprano
steve at pearwood.info
Sat Sep 4 03:12:27 CEST 2010
On Fri, 3 Sep 2010 12:24:00 pm David Hutto wrote:
> In the below function I'm trying to iterate over the lines in a
> textfile, and try to match with a regex pattern that iterates over
> the lines in a dictionary(not {}, but turns a text list of
> alphabetical words into a list using readlines()).
>
> def regexfiles(filename):
[...]
Your function does too much. It:
1 manages a file opened for input and output;
2 manages a dictionary file;
3 handles multiple searches at once (findall and search);
4 handles print output;
and you want it to do more:
5 handle multiple "select" terms.
Your function is also poorly defined -- I've read your description
repeatedly, and studied the code you give, and your three follow up
posts, and I still don't have the foggiest clue of what it's supposed
to accomplish! You do two searches on each iteration, but other than
print the results, you don't do anything with it.
What is the *meaning* of the function? "regexfiles" is a meaningless
name, and your description "I'm trying to iterate over the lines in a
textfile, and try to match with a regex pattern that iterates over the
lines in a dictionary" is confusing -- the first part is fine, but what
do you mean by a regex that iterates over the lines in a dictionary?
What is the purpose of a numeric variable called "search"? It looks like
a counter to me, not a search, it is incremented each time through the
loop. The right way to do that is with enumerate(), not a manual loop
variable.
Why do you call readlines() instead of read()? This makes no sense to
me. You then convert a list of strings into a single string like this:
readlines() returns ['a\n', 'b\n', 'c\n']
calling str() returns "['a\n', 'b\n', 'c\n']"
but the result includes a lot of noise that weren't in the original
file: open and close square brackets, commas, quotation marks.
I think perhaps you want ''.join(readlines()) instead, but even that is
silly, because you should just call read() and get 'a\nb\nc\n'.
You should split this up into multiple functions. It will make
comprehension, readability, debugging and testing all much, much
easier. Code should be self-documenting -- ideally you will never need
to write a comment, because what the code does should be obvious from
your choice of function and variable names.
I don't understand why you're using regex here. If I'm guessing
correctly, the entries in the dictionary are all ordinary (non-regex)
strings, like:
ape
cat
dog
only many, many more words :)
Given that, using the re module is like using a bulldozer to crack a
peanut, and about as efficient. Instead of re.search(target, text) you
should just use text.find(target). There's no built-in equivalent to
re.findall, but it's not hard to write one:
def findall(text, target):
results = []
start = 0
p = text.find(target)
while p != -1:
results.append(p)
p = text.find(target)
return results
(This returns a list of starting positions rather than words. It's easy
to modify to do the other -- just append target instead of p.)
Anyway, here's my attempt to re-write your function. I've stuck to
regexes just in case, and there's lots of guess-work here, because I
don't understand what you're trying to accomplish, but here goes
nothing:
# Change this to use real English *wink*
DEFAULT_DICT = '/var/www/main/american-english'
def get_dict_words(filename=''):
"""Return a list of words from the given dictionary file.
If not given, a default dictionary is used.
The format of the file should be one word per line.
"""
if not filename:
filename = DEFAULT_DICT
# Don't use file, use open.
dictionary = open(filename)
words = dictionary.readlines()
# Best practice is to explicitly close files when done.
dictionary.close()
return words
def search(target, text):
"""Return the result of two different searches for target in text.
target should be a regular expression or regex object; text should
be a string.
Result returned is a two-tuple:
(list of matches, regex match object or None)
"""
if isinstance(target, str):
target = re.compile(target)
a = target.findall(text)
b = target.search(text)
return (a, b)
def print_stuff(i, target, text, a, b):
# Boring helper function to do printing.
print "counter = %d" % i
print "target = %s" % target
print "text = %s" % text
print "findall = %s" % a
print "search = %s" % b
print
def main(text):
"""Print the results of many regex searches on text."""
for i, word in get_dict_words():
a, b, = search(word, text)
print_stuff(i, word, text, a, b)
# Now call it:
fp = open(filename)
text = fp.read()
fp.close()
main(text)
I *think* this should give the same results you got.
--
Steven D'Aprano
More information about the Tutor
mailing list