
On Fri, May 27, 2011 at 12:59 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I wonder if it would be possible to generalize Nick's work on urllib.parse to a more general class.
I thought about that when I was implementing it, and I don't really think so. The decode/encode cycle in urllib.parse is based on a few key elements: 1. The URL standard itself mandates a 7-bit ASCII bytestream. The implicit conversion accordingly uses the ascii codec with strict error handling, so if you want to handle malformed URLs, you still have to do your own decoding and pass in already decoded text strings rather than the raw bytes (as there is no way for the library to guess an appropriate encoding for any non-ASCII bytes it encounters). 2. The affected urllib.parse APIs are all stateless - the output is determined by the inputs. Accordingly, it was fairly straightforward to coerce all of the arguments to strings and also create a "coerce result" callable that is either a no-op that just returns its argument (string inputs) or calls .encode() on its input and returns that (bytes/bytearray inputs) 3. All of the operations that returned tuples were updated to return namedtuple subclasses with an encode() method that passed the encoding command down to the individual tuple elements. These subclasses all came in matched pairs (one that held only strings, another that held only bytes). The argument coercion function could probably be extracted and placed in the string module, but it isn't all that useful on its own - it's adequate if you're only returning single strings, but needs to be matched with an appropriately designed class hierarchy if you're returning anything more complicated. I believe RDM used a similar design pattern of parallel bytes and string based return types to get the email package into a more usable state for 3.2. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia