[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
James Y Knight
foom at fuhm.net
Tue Feb 14 19:36:26 CET 2006
On Feb 14, 2006, at 11:25 AM, Phillip J. Eby wrote:
> At 11:08 AM 2/14/2006 -0500, James Y Knight wrote:
>> I like it, it makes sense. Unicode strings are simply not allowed as
>> arguments to the byte constructor. Thinking about it, why would it be
>> otherwise? And if you're mixing str-strings and unicode-strings, that
>> means the str-strings you're sometimes giving are actually not byte
>> strings, but character strings anyhow, so you should be encoding
>> those too. bytes(s_or_U.encode('utf-8')) is a perfectly good
>> spelling.
> Actually, I think you mean:
>
> if isinstance(s_or_U, str):
> s_or_U = s_or_U.decode('utf-8')
>
> b = bytes(s_or_U.encode('utf-8'))
>
> Or maybe:
>
> if isinstance(s_or_U, unicode):
> s_or_U = s_or_U.encode('utf-8')
>
> b = bytes(s_or_U)
>
> Which is why I proposed that the boilerplate logic get moved *into*
> the bytes constructor. I think this use case is going to be common
> in today's Python, but in truth I'm not as sure what bytes() will
> get used *for* in today's Python. I'm probably overprojecting
> based on the need to use str objects now, but bytes aren't going to
> be a replacement for str for a good while anyway.
I most certainly *did not* mean that. If you are mixing together str
and unicode instances, the str instances _must be_ in the default
encoding (ascii). Otherwise, you are bound for failure anyhow, e.g.
''.join(['\x95', u'1']). Str is used for two things right now: 1) a
byte string. 2) a unicode string restricted to 7bit ASCII. These two
uses are separate and you cannot mix them without causing disaster.
You've created an interface which can take either a utf8 byte-string,
or unicode character string. But that's wrong and can only cause
problems. It should take either an encoded bytestring, or a unicode
character string. Not both. If it takes a unicode character string,
there are two ways of spelling that in current python: a "str" object
with only ASCII in it, or a "unicode" object with arbitrary
characters in it. bytes(s_or_U.encode('utf-8')) works correctly with
both.
James
More information about the Python-Dev
mailing list