[Python-Dev] bytes / unicode
P.J. Eby
pje at telecommunity.com
Mon Jun 21 20:17:47 CEST 2010
At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote:
>Perhaps there are more situations where a polymorphic API would be
>helpful. Such APIs are not always so easy to implement, because they
>have to be careful with literals or other constants (and even more so
>mutable state) used internally -- but it can be done, and there are
>plenty of examples in the stdlib.
What if we could use the time machine to make the APIs that *were*
polymorphic, regain their previously-polymorphic status, without
needing to actually *change* any of the code of those functions?
That's what Barry's ebytes proposal would do, with appropriate
coercion rules. Passing ebytes into such a function would yield back
ebytes, even if the function used strings internally, as long as
those strings could be encoded back to bytes using the ebytes'
encoding. (Which would normally be the case, since stdlib constants
are almost always ASCII, and the main use cases for ebytes would
involve ascii-extended encodings.)
>I'm stil unclear on exactly what bstr is supposed to be, but it sounds
>a bit like one of the rejected proposals for having a single
>(Unicode-capable) str type that is implemented using different width
>encodings (Latin-1, UCS-2, UCS-4) underneath.
Not quite - as modified by Barry's proposal (which I like better than
mine) it'd be an object that just combines bytes with an attribute
indicating the underlying encoding. When it interacts with strings,
the strings are *encoded* to bytes, rather than upgrading the bytes to text.
This is actually a big advantage for error-detection in any
application where you're working with data that *must* be encodable
in a specific encoding for output, as it allows you to catch errors
much *earlier* than you would if you only did the encoding at your
output boundary.
Anyway, this would not be the normal bytes type or string type; it's
"bytes with an encoding". It's also more general than Unicode, in
the sense that it allows you to work with character sets that don't
really *have* a proper Unicode mapping.
One issue I remember from my "enterprise" days is some of the
Asian-language developers at NTT/Verio explaining to me that unicode
doesn't actually solve certain issues -- that there are use cases
where you really *do* need "bytes plus encoding" in order to properly
express something. Unfortunately, I never quite wrapped my head
around the idea, I just remember it had something to do with the fact
that Unicode has single character codes that mean different things in
different languages, such that you were actually losing information
by converting to unicode, or something like that. (Or maybe the
characters were expressed differently in certain encodings according
to what language they came from, so you couldn't roundtrip them
through unicode without losing information. I think that's probably
was what it was; maybe somebody here can chime in more on that point.)
Anyway, a type like this would need to have at least a bit of support
from the core language, because the str type would need to be able to
handle at least the __contains__ and %/.format() coercion cases,
since these functions don't have __r*__ equivalents that a
user-implemented type could provide... and strings don't have
anything like a '__coerce__' either.
If sufficient hooks existed, then an ebytes could be implemented
outside the stdlib, and still used within it.
More information about the Python-Dev
mailing list