[Python-Dev] bytes / unicode

Michael Foord fuzzyman at voidspace.org.uk
Wed Jun 23 01:18:29 CEST 2010

On 22/06/2010 19:07, James Y Knight wrote:
> On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote:
>> Similarly I'd expect (from experience) that a programmer using Python 
>> to want to take the same approach, sticking with unencoded data in 
>> nearly all situations.
> Yeah. This is a real issue I have with the direction Python3 went: it 
> pushes you into decoding everything to unicode early,

Well, both .NET and Java take this approach as well. I wonder how they 
cope with the particular issues that have been mentioned for web 
applications - both platforms are used extensively for web apps.

Having used IronPython, which has .NET unicode strings (although it does 
a lot of magic to *allow* you to store binary data in strings for 
compatibility with CPython),  I have to say that this approach makes a 
lot of programming *so* much more pleasant.

We did a lot of I/O (can you do useful programming without I/O?) 
including working with databases, but I didn't work *much* with wire 
protocols (fetching a fair bit of data from the web though now I think 
about it). I think wire protocols can present particular problems; 
sometimes having mixed encodings in the same data it seems. Where you 
don't have these problems keeping bytes data and all Unicode text data 
separate and encoding / decoding at the boundaries is really much more 
sane and pleasant.

It would be a real shame if we decided that the way forward for Python 3 
was to try and move closer to how bytes/text was handled in Python 2.

All the best,


> even when you don't care -- all you really wanted to do is pass it 
> from one API to another, with some well-defined transformations, which 
> don't actually depend on it having being decoded properly. (For 
> example, extracting the path from the URL and attempting to open it as 
> a file on the filesystem.)
> This means that Python3 programs can become *more* fragile in the face 
> of random data you encounter out in the real world, rather than less 
> fragile, which was the goal of the whole exercise.
> The surrogateescape method is a nice workaround for this, but I can't 
> help thinking that it might've been better to just treat stuff as 
> possibly-invalid-but-probably-utf8 byte-strings from input, through 
> processing, to output. It seems kinda too late for that, though: next 
> time someone designs a language, they can try that. :)
> James
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk


READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100623/69764944/attachment-0001.html>

More information about the Python-Dev mailing list