[Python-Dev] bytes / unicode
Guido van Rossum
guido at python.org
Thu Jun 24 17:41:14 CEST 2010
On Thu, Jun 24, 2010 at 8:25 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum <guido at python.org> wrote:
>> Also, IMO a polymorphic function should *not* accept *mixed*
>> bytes/text input -- join('x', b'y') should be rejected. But join('x',
>> 'y') -> 'x/y' and join(b'x', b'y') -> b'x/y' make sense to me.
>
> A policy of allowing arguments to be either str or bytes, but not a
> mixture, actually avoids one of the more painful aspects of the 2.x
> "promote mixed operations to unicode" approach. Specifically, you
> either had to scan all the arguments up front to check for unicode, or
> else you had to stop what you were doing and start again with the
> unicode version if you encountered unicode partway through. Neither
> was particularly nice to implement.
Right. Polymorphic functions should *not* allow mixing text and bytes.
It's all text or all bytes.
> As you noted elsewhere, literals and string methods are still likely
> to be a major sticking point with that approach - common operations
> like ''.join(seq) and b''.join(seq) aren't polymorphic, so functions
> that use them won't be polymorphic either. (It's only the str->unicode
> promotion behaviour in 2.x that works around this problem there).
>
> Would it be heretical to suggest that sum() be allowed to work on
> strings to at least eliminate ''.join() as something that breaks bytes
> processing? It already works for bytes, although it then fails with a
> confusing message for bytearray:
>
>>>> sum(b"a b c".split(), b'')
> b'abc'
>
>>>> sum(bytearray(b"a b c").split(), bytearray(b''))
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: sum() can't sum bytes [use b''.join(seq) instead]
>
>>>> sum("a b c".split(), '')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: sum() can't sum strings [use ''.join(seq) instead]
I don't think we should abuse sum for this. A simple idiom to get the
*empty* string of a particular type is x[:0] so you could write
something like this to concatenate a list or strings or bytes:
xs[:0].join(xs). Note that if xs is empty we wouldn't know what to do
anyway so this should be disallowed.
--
--Guido van Rossum (python.org/~guido)
More information about the Python-Dev
mailing list