What is wrong with this regex for matching emails?
Ben Finney
ben+python at benfinney.id.au
Sun Dec 17 15:57:27 EST 2017
Peng Yu <pengyu.ut at gmail.com> writes:
> Hi,
>
> I would like to extract "abc at efg.hij.xyz". But it only shows ".hij".
Others have address this question. I'll answer a separate one:
> Does anybody see what is wrong with it? Thanks.
One thing that's wrong with it is that it is far too restrictive.
> email_regex = re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)')
This excludes a great many email addresses that are valid. Please don't
try to restrict a match for email addresses that will exclude actual
email addresses.
For an authoritative guide to matching email addresses, see RFC 3696 §3
<URL:https://tools.ietf.org/html/rfc3696#section-3>.
A more correct match would boil down to:
* Match any printable Unicode characters (not just ASCII).
* Locate the *last* ‘@’ character. (An email address may contain more
than one ‘@’ character; you should allow any printable ASCII character
in the local part.)
* Match the domain part as the text after the last ‘@’ character. Match
the local part as anything before that character. Reject an address
that has either of these empty.
* Validate the domain by DNS request. Your program is not an authority
for what domains are valid; the only authority for that is the DNS.
* Don't validate the local part at all. Your program is not an authority
for what local parts are accepted to the destination host; the only
authority for that is the destination mail host.
--
\ “Jealousy: The theory that some other fellow has just as little |
`\ taste.” —Henry L. Mencken |
_o__) |
Ben Finney
More information about the Python-list
mailing list