[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Mon Oct 3 16:49:44 CEST 2005

Martin Blais wrote:
> Hi.
> 
> Like a lot of people (or so I hear in the blogosphere...), I've been
> experiencing some friction in my code with unicode conversion
> problems.  Even when being super extra careful with the types of str's
> or unicode objects that my variables can contain, there is always some
> case or oversight where something unexpected happens which results in
> a conversion which triggers a decode error.  str.join() of a list of
> strs, where one unicode object appears unexpectedly, and voila!
> exception galore.  Sometimes the problem shows up late because your
> test code doesn't always contain accented characters.  I'm sure many
> of you experienced that or some variant at some point.
> 
> I came to realize recently that this problem shares strong similarity
> with the problem of implicit type conversions in C++, or at least it
> feels the same:  Stuff just happens implicitly, and it's hard to track
> down where and when it happens by just looking at the code.  Part of
> the problem is that the unicode object acts a lot like a str, which is
> convenient, but...

I agree.  I think it was a mistake to implicitly convert mixed string
expressions to unicode.

> What if we could completely disable the implicit conversions between
> unicode and str?  In other words, if you would ALWAYS be forced to
> call either .encode() or .decode() to convert between one and the
> other... wouldn't that help a lot deal with that issue?

Perhaps.

> How hard would that be to implement? 

Not hard. We considered doing it for Zope 3, but ...

 > Would it break a lot of code?

Yes.

> Would some people want that? 

No, I wouldn't want lots of code to break. ;)

 > (I know I would, at least for some of my
> code.)  It seems to me that this would make the code more explicit and
> force the programmer to become more aware of those conversions.  Any
> opinions welcome.

I think it's too late to change this.  I wish it had been done
differently.  (OTOH, I'm very happy we have Unicode support, so
I'm not really complaining. :)

I'll note that this hasn't been that much of a problem for us in Zope.
We follow the strategy:

Antoine Pitrou wrote:
...
 > A good rule of thumb is to convert to unicode everything that is
 > semantically textual, and to only use str for what is to be semantically
 > treated as a string of bytes (network packets, identifiers...). This is
 > also, AFAIU, the semantic model which is favoured for a hypothetical
 > future version of Python.

This approach has worked pretty well for us.  Still, when there is a problem,
it's a real pain to debug because the error occurs too late, as you point
out.

Jim

-- 
Jim Fulton           mailto:jim at zope.com       Python Powered!
CTO                  (540) 361-1714            http://www.python.org
Zope Corporation     http://www.zope.com       http://www.zope.org