[Python-Dev] email package status in 3.X

P.J. Eby pje at telecommunity.com
Sun Jun 20 20:40:56 CEST 2010


At 10:57 AM 6/20/2010 -0700, Guido van Rossum wrote:
>The problem comes exactly where you find it: when *porting* existing
>code that uses aforementioned ways to alleviate the pain, you find
>that the hacks no longer work and a properly layered design is needed
>that clearly distinguishes between which variables contain bytes and
>which text.

Actually, I would say that it's more that (in the network protocol 
case) we *have* bytes, some of which we would like to *treat* as 
text, yet do not wish to constantly convert back and forth to 
full-blown unicode -- especially since the protocols themselves 
designate ASCII or latin-1 at the transport layer (sometimes with 
odder encodings above, but these already have to be explicitly dealt 
with by existing code).

While reading over this thread, I'm wondering whether at least my 
(WSGI-related) problems in this area would be solved by the 
availability of a type (say "bstr") that was simply a wrapper 
providing string-like behavior over an underlying bytes, byte array, 
or memoryview, that would produce objects of compatible type when 
combined with strings (by encoding them to match).

Then, I could wrap bytes with it to pass them to string operations, 
and then feed them back into everything else.  The bstr type ideally 
would be directly compatible with bytes I/O, or at least have a 
.bytes attribute that would be.

It seems like that would reduce WSGI porting issues quite a bit, 
since it would mostly consist of throwing extra bstr() calls in where 
things are breaking, and maybe grabbing the .bytes attribute for I/O.

This approach would still be explicit as to what types you're working 
with, but would not require O(n) *conversions* at every interaction 
boundary.  It would be limited, of course, to single-byte encodings 
with all characters (0-255) valid.

OTOH, maybe there should just be a bytestrings module with 
bytestrings.ascii and bytestrings.latin1, and between the two that 
should cover the network protocol needs quite well.

Actually, if the Python 3 str() constructor could do O(1) conversion 
for the latin-1 case (i.e., just wrapped the underlying bytes), I 
would just put, "bstr = lambda x: str(x,'latin-1')" at the top of my 
programs and have roughly the same effect.

This idea is still a bit half-baked, but a more baked version might 
be just the ticket for porting stuff that used str to work with bytes 
in 2.x, if only because writing, e.g.:

      newurl = bstr(urljoin(bstr(base), 'subdir'))

seems so much saner than writing *this* everywhere:

      newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1')

It is perhaps a bit late to propose this idea, since ideally we would 
also want to use it in 2.x to aid porting.  But I'm curious if any 
other people here experiencing byte/unicode woes in relation to 
network protocols would find this a solution to their chief 
frustration.  (i.e., that the stdlib often insists now on strings, 
where effectively bytes were usable before, and thus one must do 
conversions both coming and going.)



More information about the Python-Dev mailing list