[Python-Dev] email package status in 3.X

Mon Jun 21 14:20:13 CEST 2010

On Mon, Jun 21, 2010 at 11:58 AM, P.J. Eby <pje at telecommunity.com> wrote:
> At 08:08 AM 6/21/2010 +1000, Nick Coghlan wrote:
>>
>> Perhaps if people could identify which specific string methods are
>> causing problems?
>
> __getitem__(int) returns an integer rather than a bytestring, so anything
> that manipulates individual characters can't be given bytes and have it
> work.

It can if you use length one slices rather than simple indexing.
Depending on the details, such algorithms may still fail for
multi-byte codecs though.

> That was one of the key differences I had in mind for a bstr type, apart
> from  designing it to coerce normal strings to bstrs in cross-type
> operations, and to allow O(1) "conversion" to/from bytes.

Erk, that just sounds like a recipe for recreating the problems 2.x
has in a new form.

> Another randomly chosen byte/string incompatibility (Python 3.1; I don't
> have 3.2 handy at the moment):
>
>>>> os.path.join(b'x','y')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "c:\Python31\lib\ntpath.py", line 161, in join
>    if b[:1] in seps:
> TypeError: Type str doesn't support the buffer API
>
>>>> os.path.join('x',b'y')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "c:\Python31\lib\ntpath.py", line 161, in join
>    if b[:1] in seps:
> TypeError: 'in <string>' requires string as left operand, not bytes
>
> Ironically, it seems to me that in trying to make the type distinction more
> rigid, Py3K fails in this area precisely because it is not a rigidly typed
> language in the Java or Haskell sense: i.e., os.path.join doesn't say, "I
> need two stringlike objects of the *same type*", not even in its docstring.

I believe it actually needs the objects to be compatible with the type
of os.sep, rather than just with each other (i.e. the type
restrictions on os.path.join are the same as those on os.sep.join,
even though the join algorithm itself is slightly different). This
restriction should be mentioned in the Py3k docstring and docs for
os.path.join - if it isn't, that would be a doc bug.

> At least in Java, you would either implement a "path" type with coercions
> from bytes and strings, or you'd have a class with overloaded methods for
> handling join operations on bytes and strings, respectively, thereby
> avoiding this whole mess.
>
> (Alas, this little example on the 'in' operator also shows that my bstr
> effort would probably fail anyway, because there's no '__rcontains__'
> (__lcontains__?) to allow it to override the str type's __contains__.)

OK, these examples convince me that the incompatibility problem is
real. However, I don't think a bstr type can solve them even without
the __rcontains__ problem - it would just recreate the pain that we
already have in the 2.x world.

Something that may make sense to ease the porting process is for some
of these "on the boundary" I/O related string manipulation functions
(such as os.path.join) to grow "encoding" keyword-only arguments. The
recommended approach would be to provide all strings, but bytes could
also be accepted if an encoding was specified. (If you want to mix
encodings - tough, do the decoding yourself).

For the idea of avoiding excess copying of bytes through multiple
encoding/decoding calls... isn't that meant to be handled at an
architectural level (i.e. decode once on the way in, encode once on
the way out)? Optimising the single-byte codec case by minimising data
copying (possibly through creative use of PEP 3118) may be something
that we want to look at eventually, but it strikes me as something of
a premature optimisation at this point in time (i.e. the old adage
"first get it working, then get it working fast").

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia