Ann: Validating Emails and HTTP URLs in Python
Philip Semanchuk
philip at semanchuk.com
Mon May 3 10:28:54 EDT 2010
On May 3, 2010, at 10:13 AM, andrew cooke wrote:
>> FYI, Fourthought's PyXML has a module called uri.py that contains
>> regexes for URL validation. I've over a million URLs (harvested from
>> the Internet) through their code. I can't say I checked each and
>> every
>> result, but I never saw anything that would lead me to believe it was
>> misbehaving.
>>
>> It might be interesting to compare the results of running a large
>> list
>> of URLs through your code and theirs.
>>
>> Good luck
>> Philip
>
> It's getting a set of URLs that's the main problem. I've tested it
> with URL examples in RFC 3696, and with a few extra ones that test
> particular issues, but when I looked around I couldn't find any
> public, obvious list of URLs for general testing. Could I use your
> list?
>
> Also, same for emails...
If I still had a list of URLs you'd be welcome to it. The list was
generated as part of a spidering project that's long gone.
If all you want to do is generate a list of URLs and email addresses,
you could cobble a robots.txt-respectful spider without too much
trouble. As with so many things, it's just an SMOP [1]. =)
[1] - http://en.wikipedia.org/wiki/Small_matter_of_programming
bye
Philip
More information about the Python-list
mailing list