Re: [Python-Dev] email package status in 3.X

June 21, 2010

      Barry Warsaw wrote:
...
On Jun 21, 2010, at 12:34 PM, Toshio Kuratomi wrote:
...
I like the idea of having encoding information carried with the data.
I don't think that an ebytes type that can *optionally* have an encoding
attribute makes the situation less confusing, though.
Agreed.  I think the attribute should always be there, but there probably
needs to be a magic value (perhaps None) that indicates and unknown, manual,
garbage, error, broken encoding.
Examples: you read bytes off a socket and don't know what the encoding is; you
concatenate two ebytes that have incompatible encodings.
Such extra information tends to be lost whenever you pass the
bytes data through a C level API or some other function that
doesn't know about the special nature of those objects, treating
them just like any bytes object.

It may sound nice in theory, but in practice it doesn't work out.

Besides, if you do know the encoding, you can easily carry the
data around in a Unicode str object.

The problem lies elsewhere: What to do with a piece of text for
which you don't know the encoding and how to combine that piece
of text with other pieces of text for which you do know the
encoding.

There are a few options at hand:

 * you keep working on the bytes data and only convert things
   to Unicode when needed and where the encoding is known

 * you decode the bytes data for which you don't have the encoding
   information into some special Unicode form (eg. using the
   surrogateescape error handler) and hope that when the time
   comes to encode the Unicode data back into bytes, the codec
   supports reversing the conversion

 * you manage the data as a list of Unicode str and
   bytes objects and don't even try to be clever about encodings
   of text without unknown encoding

It depends a lot on the use case, which of these options fits
best.
...
...
To me the biggest
problem with python-2.x's unicode/bytes handling was not that it threw
exceptions but that it didn't always throw exceptions.  You might test this
in python2::
   t = u'cafe'
   function(t)
And say, ah my code works.  Then a user gives it this::
   t = u'café'
   function(t)
And get a unicode error because the function only works with unicode in the
ascii range.
That's an excellent point.
Here's a little known fact: by changing the Python2 default
encoding to 'undefined' (yes, that's a real codec !), you can disable
all automatic string coercion in Python2.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 21 2010)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK                27 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

Re: [Python-Dev] email package status in 3.X

M.-A. Lemburg