Parsing for email addresses
Tim Chase
python.list at tim.thechases.com
Mon Feb 15 19:35:21 EST 2010
Jonathan Gardner wrote:
> On Feb 15, 3:34 pm, galileo228 <mattbar... at gmail.com> wrote:
>> I'm trying to write python code that will open a textfile and find the
>> email addresses inside it. I then want the code to take just the
>> characters to the left of the "@" symbol, and place them in a list.
>> (So if galileo... at gmail.com was in the file, 'galileo228' would be
>> added to the list.)
>>
>> Any suggestions would be much appeciated!
>>
>
> You may want to use regexes for this. For every match, split on '@'
> and take the first bit.
>
> Note that the actual specification for email addresses is far more
> than a single regex can handle. However, for almost every single case
> out there nowadays, a regex will get what you need.
You can even capture the part as you find the regexps. As
Jonathan mentions, finding RFC-compliant email addresses can be a
hairy/intractable problem. But you can get a pretty close
approximation:
import re
r = re.compile(r'([-\w._+]+)@(?:[-\w]+\.)+(?:\w{2,5})', re.I)
# ^
# if you want to allow local domains like
# user at localhost
# then change the "+" marked with the "^"
# to a "*" and the "{2,5}" to "+" to unlimit
# the TLD. This will change the outcome
# of the last test "jim at com" to True
for test, expected in (
('jim at example.com', True),
('jim at sub.example.com', True),
('@example.com', False),
('@sub.example.com', False),
('@com', False),
('jim at com', False),
):
m = r.match(test)
if bool(m) ^ expected:
print "Failed: %r should be %s" % (test, expected)
emails = set()
for line in file('test.txt'):
for match in r.finditer(line):
emails.add(match.group(1))
print "All the emails:",
print ', '.join(emails)
-tkc
More information about the Python-list
mailing list