[Python-Dev] Divorcing str and unicode (no more implicit conversions).
Jim Fulton
jim at zope.com
Mon Oct 3 16:49:44 CEST 2005
Martin Blais wrote:
> Hi.
>
> Like a lot of people (or so I hear in the blogosphere...), I've been
> experiencing some friction in my code with unicode conversion
> problems. Even when being super extra careful with the types of str's
> or unicode objects that my variables can contain, there is always some
> case or oversight where something unexpected happens which results in
> a conversion which triggers a decode error. str.join() of a list of
> strs, where one unicode object appears unexpectedly, and voila!
> exception galore. Sometimes the problem shows up late because your
> test code doesn't always contain accented characters. I'm sure many
> of you experienced that or some variant at some point.
>
> I came to realize recently that this problem shares strong similarity
> with the problem of implicit type conversions in C++, or at least it
> feels the same: Stuff just happens implicitly, and it's hard to track
> down where and when it happens by just looking at the code. Part of
> the problem is that the unicode object acts a lot like a str, which is
> convenient, but...
I agree. I think it was a mistake to implicitly convert mixed string
expressions to unicode.
> What if we could completely disable the implicit conversions between
> unicode and str? In other words, if you would ALWAYS be forced to
> call either .encode() or .decode() to convert between one and the
> other... wouldn't that help a lot deal with that issue?
Perhaps.
> How hard would that be to implement?
Not hard. We considered doing it for Zope 3, but ...
> Would it break a lot of code?
Yes.
> Would some people want that?
No, I wouldn't want lots of code to break. ;)
> (I know I would, at least for some of my
> code.) It seems to me that this would make the code more explicit and
> force the programmer to become more aware of those conversions. Any
> opinions welcome.
I think it's too late to change this. I wish it had been done
differently. (OTOH, I'm very happy we have Unicode support, so
I'm not really complaining. :)
I'll note that this hasn't been that much of a problem for us in Zope.
We follow the strategy:
Antoine Pitrou wrote:
...
> A good rule of thumb is to convert to unicode everything that is
> semantically textual, and to only use str for what is to be semantically
> treated as a string of bytes (network packets, identifiers...). This is
> also, AFAIU, the semantic model which is favoured for a hypothetical
> future version of Python.
This approach has worked pretty well for us. Still, when there is a problem,
it's a real pain to debug because the error occurs too late, as you point
out.
Jim
--
Jim Fulton mailto:jim at zope.com Python Powered!
CTO (540) 361-1714 http://www.python.org
Zope Corporation http://www.zope.com http://www.zope.org
More information about the Python-Dev
mailing list