[Tutor] Using Python and Regex

Mon Aug 11 14:03:24 CEST 2014

Bill wrote:

> Thanks for yoru reply. This was my first attempt,when running through
> idleid get the following error:-
> 
> 
> Traceback (most recent call last):
>   File "C:\Users\Bill\Desktop\TXT_Output\email_extraction_script.py", line
> 27, in <module>
>     traverse_dirs(working_dir)
>   File "C:\Users\Bill\Desktop\TXT_Output\email_extraction_script.py", line
> 20, in traverse_dirs
>     if match:
> UnboundLocalError: local variable 'match' referenced before assignment
> 
> My code is as follows:

>                 for line in lines:
>                     match =
> re.search(r"\b[^\<][A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}
[^\>]\b",l
> ine)
>                 if match:
>                         print(match.group(0))
>                         otext = match.group(0) + ",\n"
>                         output_file.write(otext)

The indentation of 'if match' is wrong; the way you wrote the line will be 
executed after the for loop, but you want it inside the loop.

You are lucky that the first file you encountered was empty and thus the 
match variable never set ;) Otherwise the error would have been harder to 
find.

Random remarks:

> def traverse_dirs(wdir):
>     grabline = 0
>     for f in os.listdir('.'):

The listdir() argument should probably be wdir instead of '.'.

>         if os.path.isfile(f) == True:

The idiomatic way to spell this is

          if os.path.isfile(f):

>                 content = open(f)
>                 lines = content.readlines()
>                 for line in lines:

The readlines() call will put the whole file into a potentially huge list. 
You don't need to do this for your application. Instead iterate over the 
file directly:

                content = open(f)
                for line in content:

That keeps memory consumption low and the data processing can start 
immediately.

PS: The way you wrote it your program will process a single directory. If 
you want to look into subdirectories you should read up on os.walk() as 
already suggested. You will end up with something like

for path, dirs, files in os.walk(wdir):
    for name in files:
        f = os.path.join(path, name)
        content = open(f)
        ...