Ann: Validating Emails and HTTP URLs in Python
livibetter
livibetter at gmail.com
Tue May 4 08:02:51 EDT 2010
First, it's good to see a library has URL and email validator.
But I found there might be a problem in your validator, the problems I
found are these URLs:
http://example.com/path
http://example.com/path)
http://example.com/path]
http://example.com/path}
By my understanding from RFCs, only first two are valid.
>>> from lepl.apps.rfc3696 import *
>>> v = HttpUrl()
>>> v('http://example.com/')
True
>>> v('http://example.com/path')
True
>>> v('http://example.com/path)')
True
>>> v('http://example.com/path]')
True
>>> v('http://example.com/path}')
True
You use RFC 3969 [1] to write your code (I read your source code,
lepl.apps.rfc3696._HttpUrl()), I think your code should only return
True for first case, but all return True. Maybe I use it incorrectly?
And I think that has a slight issue because RFC 3969 was written based
on RFC 2396 [2], which is obsoleted by RFC 3986 [3]. I never really
read RFC 3969, I am not sure if there is problem.
But in RFC 3969, it writes
The following characters are reserved in many URIs -- they must be
used for either their URI-intended purpose or must be encoded.
Some
particular schemes may either broaden or relax these restrictions
(see the following sections for URLs applicable to "web pages" and
electronic mail), or apply them only to particular URI component
parts.
; / ? : @ & = + $ , ?
However in RFC 2396 (the obsoleted RFC), "3.3. Path Component,"
The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the
resource
within the scope of that scheme and authority.
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
Here is unreserved of pchar:
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" |
")"
In RFC 3986, they are a bit different, but my point here is "(" and
")".
The Uri from 4Suite return the results I expect:
>>> import Ft.Lib.Uri as U
>>> U.MatchesUriSyntax('http://example.com/path')
True
>>> U.MatchesUriSyntax('http://example.com/path)')
True
>>> U.MatchesUriSyntax('http://example.com/path}')
False
>>> U.MatchesUriSyntax('http://example.com/path]')
False
I think you should use (read) RFC 3986 not RFC 3696 for URL
validation.
One more thing, HttpUrl()'s docstring should s/email/url/.
[1]: http://tools.ietf.org/html/rfc3696
[2]: http://tools.ietf.org/html/rfc2396
[3]: http://tools.ietf.org/html/rfc3986
More information about the Python-list
mailing list