[Patches] [ python-Patches-1500504 ] Alternate RFC 3986 compliant URI parsing module

Wed Feb 14 10:11:24 CET 2007

Patches item #1500504, was opened at 2006-06-05 00:50
Message generated for change (Comment added) made by ncoghlan
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1500504&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Nick Coghlan (ncoghlan)
Assigned to: Nobody/Anonymous (nobody)
Summary: Alternate RFC 3986 compliant URI parsing module

Initial Comment:
Inspired by (and based on) Paul Jimenez's uriparse
module (http://python.org/sf/1462525), urischemes tries
to put a cleaner interface in front of the URI parsing
engine.

Most of the module works with a URI subclass of tuple
that is always a 5-tuple (scheme, authority, path,
query, fragment).

The authority component is either None, or a
URIAuthority subclass of tuple that is always a 4-tuple
(user, password, host, port).

The function make_uri will create a URI string from the
5 constituent components of a URI. The components do
not need to be strings - if they are not strings, str()
will be invoked on them (this allows the URIAuthority
tuple subclass to be used transparently instead of a
string for the authority component). The result is
checked to ensure it is an RFC-compliant URI.

The function split_uri accepts a string and returns a
URI object with strings as the individual elements.
Invoking str() on this object will recreate a URI
string using make_uri(). The regex underlying this
operation is now broken out and available as module
level attributes like URI_PATTERN. 

The functions split_authority and make_authority are
similar, only working solely on the authority component
rather than the whole URI.

The function parse_uri digs into the internal structure
of a URI, also parsing the components. This will
replace a non-empty URI authority component string with
a URIAuthority tuple subclass. Depending on the scheme,
it may also replace other components (e.g. for mailto
links, the path is replaced with a (user, host) tuple
subclass).

The main parsing engine is still URIParser (much the
same as Paul's), but the root of the internal parser
hierarchy is now SchemeParser. This has two subclasses,
URLParser and MailtoParser. The various URL flavours
are now different instances of URLParser rather than
subclasses. All of the actual parsers are available as
module level attributes with the same name as the
scheme they parse.  Additionally, each parser knows the
name of the scheme it is intended to parse.

The parse() methods of the individual parsers are now
expected to return a URI object (SchemeParser actually
takes care of this). The parse() method also takes a
dictionary of defaults, which can override the defaults
supplied by the parser instance. The unparse() method
is gone - instead, the scheme parser should ensure that
all components returned are either strings or produce
the right thing when __str__ is invoked (e.g. see
_MailtoURIPath)

The module level 'schemes' attribute is a mapping from
scheme names to parsers that is automatically populated
with all instances of SchemeParser that are found in
the module globals()

urljoin has been renamed to join_uri to match the style
of the other names in the module.

----------------------------------------------------------------------

>Comment By: Nick Coghlan (ncoghlan)
Date: 2007-02-14 19:11

Message:
Logged In: YES 
user_id=1038590
Originator: YES

Removed all versions prior to 0.4

----------------------------------------------------------------------

Comment By: Nick Coghlan (ncoghlan)
Date: 2006-06-08 22:11

Message:
Logged In: YES 
user_id=1038590

Uploaded version 0.4

This version cleans up the logic in resolve_uripath a bit
(use a separate loop to strip the leading dot segments, add
comments explaining meaning of if statements when dealing
with dot segments).

It also exposes EmailPath (along with split_emailpath and
join_emailpath) as public objects, rather than treating them
as internal to the MailtoSchemeParser.

----------------------------------------------------------------------

Comment By: Nick Coghlan (ncoghlan)
Date: 2006-06-07 01:46

Message:
Logged In: YES 
user_id=1038590

Uploaded version 0.3 which passes all the RFC tests, as well
as the failing 4Suite tests Mike sent me based on version
0.1 and 0.2.

The last 4suite failure went away when I realised those
tests expected to operate in strict mode :)

----------------------------------------------------------------------

Comment By: Nick Coghlan (ncoghlan)
Date: 2006-06-05 23:53

Message:
Logged In: YES 
user_id=1038590

Updated version attached which addresses some issues raised
by Mike Brown in private mail (the difference between a URI
and a URI reference and some major differences between URI
paths and posix paths).

Also settled on split/join for the component separation and
recombination operations and made the join methods all take
a tuple so that join_x(split_x(uri)) round trips.

Based on the terminology in the RFC, the function to combine
a URI reference with a base URI is now called "resolve_uriref".

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1500504&group_id=5470