[Tutor] Using Python and Regex
Peter Otten
__peter__ at web.de
Mon Aug 11 14:03:24 CEST 2014
Bill wrote:
> Thanks for yoru reply. This was my first attempt,when running through
> idleid get the following error:-
>
>
> Traceback (most recent call last):
> File "C:\Users\Bill\Desktop\TXT_Output\email_extraction_script.py", line
> 27, in <module>
> traverse_dirs(working_dir)
> File "C:\Users\Bill\Desktop\TXT_Output\email_extraction_script.py", line
> 20, in traverse_dirs
> if match:
> UnboundLocalError: local variable 'match' referenced before assignment
>
> My code is as follows:
> for line in lines:
> match =
> re.search(r"\b[^\<][A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}
[^\>]\b",l
> ine)
> if match:
> print(match.group(0))
> otext = match.group(0) + ",\n"
> output_file.write(otext)
The indentation of 'if match' is wrong; the way you wrote the line will be
executed after the for loop, but you want it inside the loop.
You are lucky that the first file you encountered was empty and thus the
match variable never set ;) Otherwise the error would have been harder to
find.
Random remarks:
> def traverse_dirs(wdir):
> grabline = 0
> for f in os.listdir('.'):
The listdir() argument should probably be wdir instead of '.'.
> if os.path.isfile(f) == True:
The idiomatic way to spell this is
if os.path.isfile(f):
> content = open(f)
> lines = content.readlines()
> for line in lines:
The readlines() call will put the whole file into a potentially huge list.
You don't need to do this for your application. Instead iterate over the
file directly:
content = open(f)
for line in content:
That keeps memory consumption low and the data processing can start
immediately.
PS: The way you wrote it your program will process a single directory. If
you want to look into subdirectories you should read up on os.walk() as
already suggested. You will end up with something like
for path, dirs, files in os.walk(wdir):
for name in files:
f = os.path.join(path, name)
content = open(f)
...
More information about the Tutor
mailing list