[Python-Dev] bytes / unicode

Mon Jun 21 20:17:47 CEST 2010

At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote:
>Perhaps there are more situations where a polymorphic API would be
>helpful. Such APIs are not always so easy to implement, because they
>have to be careful with literals or other constants (and even more so
>mutable state) used internally -- but it can be done, and there are
>plenty of examples in the stdlib.

What if we could use the time machine to make the APIs that *were* 
polymorphic, regain their previously-polymorphic status, without 
needing to actually *change* any of the code of those functions?

That's what Barry's ebytes proposal would do, with appropriate 
coercion rules.  Passing ebytes into such a function would yield back 
ebytes, even if the function used strings internally, as long as 
those strings could be encoded back to bytes using the ebytes' 
encoding.  (Which would normally be the case, since stdlib constants 
are almost always ASCII, and the main use cases for ebytes would 
involve ascii-extended encodings.)

>I'm stil unclear on exactly what bstr is supposed to be, but it sounds
>a bit like one of the rejected proposals for having a single
>(Unicode-capable) str type that is implemented using different width
>encodings (Latin-1, UCS-2, UCS-4) underneath.

Not quite - as modified by Barry's proposal (which I like better than 
mine) it'd be an object that just combines bytes with an attribute 
indicating the underlying encoding.  When it interacts with strings, 
the strings are *encoded* to bytes, rather than upgrading the bytes to text.

This is actually a big advantage for error-detection in any 
application where you're working with data that *must* be encodable 
in a specific encoding for output, as it allows you to catch errors 
much *earlier* than you would if you only did the encoding at your 
output boundary.

Anyway, this would not be the normal bytes type or string type; it's 
"bytes with an encoding".  It's also more general than Unicode, in 
the sense that it allows you to work with character sets that don't 
really *have* a proper Unicode mapping.

One issue I remember from my "enterprise" days is some of the 
Asian-language developers at NTT/Verio explaining to me that unicode 
doesn't actually solve certain issues -- that there are use cases 
where you really *do* need "bytes plus encoding" in order to properly 
express something.  Unfortunately, I never quite wrapped my head 
around the idea, I just remember it had something to do with the fact 
that Unicode has single character codes that mean different things in 
different languages, such that you were actually losing information 
by converting to unicode, or something like that.  (Or maybe the 
characters were expressed differently in certain encodings according 
to what language they came from, so you couldn't roundtrip them 
through unicode without losing information.  I think that's probably 
was what it was; maybe somebody here can chime in more on that point.)

Anyway, a type like this would need to have at least a bit of support 
from the core language, because the str type would need to be able to 
handle at least the __contains__ and %/.format() coercion cases, 
since these functions don't have __r*__ equivalents that a 
user-implemented type could provide...  and strings don't have 
anything like a '__coerce__' either.

If sufficient hooks existed, then an ebytes could be implemented 
outside the stdlib, and still used within it.