[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

Fri May 27 07:11:41 CEST 2011

On Fri, May 27, 2011 at 12:59 PM, Stephen J. Turnbull
<stephen at xemacs.org> wrote:
> I wonder if it would be possible to generalize Nick's work on
> urllib.parse to a more general class.

I thought about that when I was implementing it, and I don't really
think so. The decode/encode cycle in urllib.parse is based on a few
key elements:

1. The URL standard itself mandates a 7-bit ASCII bytestream. The
implicit conversion accordingly uses the ascii codec with strict error
handling, so if you want to handle malformed URLs, you still have to
do your own decoding and pass in already decoded text strings rather
than the raw bytes (as there is no way for the library to guess an
appropriate encoding for any non-ASCII bytes it encounters).
2. The affected urllib.parse APIs are all stateless - the output is
determined by the inputs. Accordingly, it was fairly straightforward
to coerce all of the arguments to strings and also create a "coerce
result" callable that is either a no-op that just returns its argument
(string inputs) or calls .encode() on its input and returns that
(bytes/bytearray inputs)
3. All of the operations that returned tuples were updated to return
namedtuple subclasses with an encode() method that passed the encoding
command down to the individual tuple elements. These subclasses all
came in matched pairs (one that held only strings, another that held
only bytes).

The argument coercion function could probably be extracted and placed
in the string module, but it isn't all that useful on its own - it's
adequate if you're only returning single strings, but needs to be
matched with an appropriately designed class hierarchy if you're
returning anything more complicated.

I believe RDM used a similar design pattern of parallel bytes and
string based return types to get the email package into a more usable
state for 3.2.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia