
Barry Warsaw wrote:
On Jun 21, 2010, at 12:34 PM, Toshio Kuratomi wrote:
I like the idea of having encoding information carried with the data. I don't think that an ebytes type that can *optionally* have an encoding attribute makes the situation less confusing, though.
Agreed. I think the attribute should always be there, but there probably needs to be a magic value (perhaps None) that indicates and unknown, manual, garbage, error, broken encoding.
Examples: you read bytes off a socket and don't know what the encoding is; you concatenate two ebytes that have incompatible encodings.
Such extra information tends to be lost whenever you pass the bytes data through a C level API or some other function that doesn't know about the special nature of those objects, treating them just like any bytes object. It may sound nice in theory, but in practice it doesn't work out. Besides, if you do know the encoding, you can easily carry the data around in a Unicode str object. The problem lies elsewhere: What to do with a piece of text for which you don't know the encoding and how to combine that piece of text with other pieces of text for which you do know the encoding. There are a few options at hand: * you keep working on the bytes data and only convert things to Unicode when needed and where the encoding is known * you decode the bytes data for which you don't have the encoding information into some special Unicode form (eg. using the surrogateescape error handler) and hope that when the time comes to encode the Unicode data back into bytes, the codec supports reversing the conversion * you manage the data as a list of Unicode str and bytes objects and don't even try to be clever about encodings of text without unknown encoding It depends a lot on the use case, which of these options fits best.
To me the biggest problem with python-2.x's unicode/bytes handling was not that it threw exceptions but that it didn't always throw exceptions. You might test this in python2:: t = u'cafe' function(t)
And say, ah my code works. Then a user gives it this:: t = u'café' function(t)
And get a unicode error because the function only works with unicode in the ascii range.
That's an excellent point.
Here's a little known fact: by changing the Python2 default encoding to 'undefined' (yes, that's a real codec !), you can disable all automatic string coercion in Python2. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 21 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2010-07-19: EuroPython 2010, Birmingham, UK 27 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/