Parsing for email addresses

Tim Chase python.list at tim.thechases.com
Mon Feb 15 19:35:21 EST 2010


Jonathan Gardner wrote:
> On Feb 15, 3:34 pm, galileo228 <mattbar... at gmail.com> wrote:
>> I'm trying to write python code that will open a textfile and find the
>> email addresses inside it. I then want the code to take just the
>> characters to the left of the "@" symbol, and place them in a list.
>> (So if galileo... at gmail.com was in the file, 'galileo228' would be
>> added to the list.)
>>
>> Any suggestions would be much appeciated!
>>
> 
> You may want to use regexes for this. For every match, split on '@'
> and take the first bit.
> 
> Note that the actual specification for email addresses is far more
> than a single regex can handle. However, for almost every single case
> out there nowadays, a regex will get what you need.

You can even capture the part as you find the regexps.  As 
Jonathan mentions, finding RFC-compliant email addresses can be a 
hairy/intractable problem.  But you can get a pretty close 
approximation:

   import re

   r = re.compile(r'([-\w._+]+)@(?:[-\w]+\.)+(?:\w{2,5})', re.I)
   #                                        ^
   # if you want to allow local domains like
   #   user at localhost
   # then change the "+" marked with the "^"
   # to a "*" and the "{2,5}" to "+" to unlimit
   # the TLD.  This will change the outcome
   # of the last test "jim at com" to True

   for test, expected in (
       ('jim at example.com', True),
       ('jim at sub.example.com', True),
       ('@example.com', False),
       ('@sub.example.com', False),
       ('@com', False),
       ('jim at com', False),
       ):
     m = r.match(test)
     if bool(m) ^ expected:
       print "Failed: %r should be %s" % (test, expected)

   emails = set()
   for line in file('test.txt'):
     for match in r.finditer(line):
       emails.add(match.group(1))
   print "All the emails:",
   print ', '.join(emails)

-tkc









More information about the Python-list mailing list