Ann: Validating Emails and HTTP URLs in Python

Philip Semanchuk philip at semanchuk.com
Mon May 3 10:28:54 EDT 2010


On May 3, 2010, at 10:13 AM, andrew cooke wrote:

>> FYI, Fourthought's PyXML has a module called uri.py that contains
>> regexes for URL validation. I've over a million URLs (harvested from
>> the Internet) through their code. I can't say I checked each and  
>> every
>> result, but I never saw anything that would lead me to believe it was
>> misbehaving.
>>
>> It might be interesting to compare the results of running a large  
>> list
>> of URLs through your code and theirs.
>>
>> Good luck
>> Philip
>
> It's getting a set of URLs that's the main problem.  I've tested it
> with URL examples in RFC 3696, and with a few extra ones that test
> particular issues, but when I looked around I couldn't find any
> public, obvious list of URLs for general testing.  Could I use your
> list?
>
> Also, same for emails...

If I still had a list of URLs you'd be welcome to it. The list was  
generated as part of a spidering project that's long gone.

If all you want to do is generate a list of URLs and email addresses,  
you could cobble a robots.txt-respectful spider without too much  
trouble. As with so many things, it's just an SMOP [1]. =)

[1] - http://en.wikipedia.org/wiki/Small_matter_of_programming

bye
Philip





More information about the Python-list mailing list