Stephen J. Turnbull
stephen at xemacs.org
Sun Feb 19 15:30:02 CET 2006
>>>>> "M" == "M.-A. Lemburg" <mal at egenix.com> writes:
M> The main reason is symmetry and the fact that strings and
M> Unicode should be as similar as possible in order to simplify
M> the task of moving from one to the other.
Those are perfectly compatible with Martin's suggestion.
M> Still, I believe that this is an educational problem. There are
M> a couple of gotchas users will have to be aware of (and this is
M> unrelated to the methods in question):
But IMO that's wrong, both in attitude and in fact. As for attitude,
users should not have to be aware of these gotchas. Codec writers, on
the other hand, should be required to avoid presenting users with
those gotchas. Martin's draconian restriction is in the right
direction, but you can argue it goes way too far.
In fact, of course it's related to the methods in question.
"Original" vs "derived" data can only be defined in terms of some
notion of the "usual semantics" of the streams, and that is going to
be strongly reflected in the semantics of the methods.
M> * "encoding" always refers to transforming original data into a
M> derived form
M> * "decoding" always refers to transforming a derived form of
M> data back into its original form
Users *already* know that; it's a very strong connotation of the
English words. The problem is that users typically have their own
concept of what's original and what's derived. For example:
M> * for Unicode codecs the original form is Unicode, the derived
M> form is, in most cases, a string
First of all, that's Martin's point!
Second, almost all Americans, a large majority of Japanese, and I
would bet most Western Europeans would say you have that backwards.
That's the problem, and it's the Unicode advocates' problem (ie,
ours), not the users'. Even if we're right: education will require
lots of effort. Rather, we should just make it as easy as possible to
do it right, and hard to do it wrong.
BTW, what use cases do you have in mind for Unicode -> Unicode
decoding? Maximally decomposed forms and/or eliminating compatibility
characters etc? Very specialized.
M> Codecs also unify the various interfaces to common encodings
M> such as base64, uu or zip which are not Unicode related.
Now this is useful and has use cases I've run into, for example in
email, where you would like to use the same interface for base64 as
for shift_jis and you'd like to be able to write
def encode-mime-body (string, codec-list):
if codec-list not in charset-codec-list:
if len (codec-list) > 1 and codec-list[-1] not in transfer-codec-list:
for codec in codec-list:
string = string.encode (codec)
mime-body = encode-mime-body ("This is a pen.",
[ 'shift_jis', 'zip', 'base64' ])
I guess I have to admit I'm backtracking from my earlier hardline
support for Martin's position, but I'm still sympathetic: (a) that's
the direct way to "make it easy to do it right", and (b) I still think
the use cases for non-Unicode codecs are YAGNI very often.
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev