Stephen J. Turnbull
stephen at xemacs.org
Sun Feb 19 18:26:39 CET 2006
>>>>> "Bob" == Bob Ippolito <bob at redivi.com> writes:
Bob> On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote:
>> But you aren't always getting *unicode* text from the decoding
>> of bytes, and you may be encoding bytes *to* bytes:
Please note that I presumed that you can indeed assume that decoding
of bytes always results in unicode, and encoding of unicode always
results in bytes. I believe Guido made the proposal relying on that
assumption too. The constructor notation makes no sense for making an
object of the same type as the original unless it's a copy constructor.
You could argue that the base64 language is indeed a different
language from the bytes language, and I'd agree. But since there's no
way in Python to determine whether a string that conforms to base64 is
supposed to be base64 or bytes, it would be a very bad idea to
interpret the distinction as one of type.
>> b2 = bytes(b, "base64")
>> b3 = bytes(b2, "base64")
>> Which direction are we going again?
Bob> This is *exactly* why the current set of codecs are INSANE.
Bob> unicode.encode and str.decode should be used *only* for
Bob> unicode codecs. Byte transforms are entirely different
Bob> semantically and should be some other method pair.
General filters are semantically different, I agree. But "encode" and
"decode" in English are certainly far more general than character
coding conversion. The use of those methods for any stream conversion
that is invertible (eg, compression or encryption) is not insane.
It's just pedagogically inconvenient given the existing confusion
(outside of python-dev, of course<wink>) about character coding
I'd like to rephrase your statement as "*only* unicode.encode and
str.decode should be used for unicode codecs". Ie, str.encode(codec)
and unicode.decode(codec) should raise errors if codec is a "unicode
codec". The question in my mind is whether we should allow other
kinds of codecs or not.
I could live with "not"<wink>, but if we're going to have other kinds
of codecs, I think they should have concrete signatures. Ie,
basestring -> basestring shouldn't be allowed. Content transfer
encodings like BASE64 and quoted-printable, compression, encryption,
etc IMO should be bytes -> bytes. Overloading to unicode -> unicode
is sorta plausible for BASE64 or QP, but YAGNI. OTOH, the Unicode
standard does define a number of unicode -> unicode transformations,
and it might make sense to generalize to case conversions etc. (Note
that these conversions are pseudo-invertible, so you can think of them
as generalized .encode/.decode pairs. The inverse is usually the
identity, which seems weird, but from the pedagogical standpoint you
could handle that weirdness by raising an error if the .encode method
To be concrete, I could imagine writing
s2 = s1.decode('upcase')
if s2 == s1:
print "Why are you shouting at me?"
print "I like calm, well-spoken snakes."
s3 = s2.encode('upcase')
if s3 == s2:
print "Never fails!"
print "See a vet; your Python is *very* sick."
I chose the decode method to do the non-trivial transformation because
.decode()'s value is supposed to be "original" text in MAL's terms.
And that's true of uppercase-only text; you're still supposed to be
able to read it, so I guess it's not "encoded". That's pretty
pedantic; I think it's better to raise on .encode('upcase').
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev