[Python-Dev] email package status in 3.X
P.J. Eby
pje at telecommunity.com
Sun Jun 20 20:40:56 CEST 2010
At 10:57 AM 6/20/2010 -0700, Guido van Rossum wrote:
>The problem comes exactly where you find it: when *porting* existing
>code that uses aforementioned ways to alleviate the pain, you find
>that the hacks no longer work and a properly layered design is needed
>that clearly distinguishes between which variables contain bytes and
>which text.
Actually, I would say that it's more that (in the network protocol
case) we *have* bytes, some of which we would like to *treat* as
text, yet do not wish to constantly convert back and forth to
full-blown unicode -- especially since the protocols themselves
designate ASCII or latin-1 at the transport layer (sometimes with
odder encodings above, but these already have to be explicitly dealt
with by existing code).
While reading over this thread, I'm wondering whether at least my
(WSGI-related) problems in this area would be solved by the
availability of a type (say "bstr") that was simply a wrapper
providing string-like behavior over an underlying bytes, byte array,
or memoryview, that would produce objects of compatible type when
combined with strings (by encoding them to match).
Then, I could wrap bytes with it to pass them to string operations,
and then feed them back into everything else. The bstr type ideally
would be directly compatible with bytes I/O, or at least have a
.bytes attribute that would be.
It seems like that would reduce WSGI porting issues quite a bit,
since it would mostly consist of throwing extra bstr() calls in where
things are breaking, and maybe grabbing the .bytes attribute for I/O.
This approach would still be explicit as to what types you're working
with, but would not require O(n) *conversions* at every interaction
boundary. It would be limited, of course, to single-byte encodings
with all characters (0-255) valid.
OTOH, maybe there should just be a bytestrings module with
bytestrings.ascii and bytestrings.latin1, and between the two that
should cover the network protocol needs quite well.
Actually, if the Python 3 str() constructor could do O(1) conversion
for the latin-1 case (i.e., just wrapped the underlying bytes), I
would just put, "bstr = lambda x: str(x,'latin-1')" at the top of my
programs and have roughly the same effect.
This idea is still a bit half-baked, but a more baked version might
be just the ticket for porting stuff that used str to work with bytes
in 2.x, if only because writing, e.g.:
newurl = bstr(urljoin(bstr(base), 'subdir'))
seems so much saner than writing *this* everywhere:
newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1')
It is perhaps a bit late to propose this idea, since ideally we would
also want to use it in 2.x to aid porting. But I'm curious if any
other people here experiencing byte/unicode woes in relation to
network protocols would find this a solution to their chief
frustration. (i.e., that the stdlib often insists now on strings,
where effectively bytes were usable before, and thus one must do
conversions both coming and going.)
More information about the Python-Dev
mailing list