[Python-Dev] bytes.from_hex()
Stephen J. Turnbull
stephen at xemacs.org
Mon Feb 20 18:31:21 CET 2006
>>>>> "Josiah" == Josiah Carlson <jcarlson at uci.edu> writes:
Josiah> I try to internalize it by not thinking of strings as
Josiah> encoded data, but as binary data, and unicode as text. I
Josiah> then remind myself that unicode isn't native on-disk or
Josiah> cross-network (which stores and transports bytes, not
Josiah> characters), so one needs to encode it as binary data.
Josiah> It's a subtle difference, but it has worked so far for me.
Seems like a lot of work for something that for monolingual usage
should "Just Work" almost all of the time.
Josiah> I notice that you seem to be in Japan, so teaching unicode
Josiah> is a must.
Yes. Japan is more complicated than that, but in Python unicode is a
must.
Josiah> If you are using the "unicode is text" and "strings are
Josiah> data", and they aren't getting it; then I don't know.
Well, I can tell you that they don't get it. One problem is PEP 263.
It makes it very easy to write programs that do line-oriented I/O with
input() and print, and the students come to think it should always be
that easy. Since Japan has at least 6 common encodings that students
encounter on a daily basis while browsing the web, plus a couple more
that live inside of MSFT Word and Java, they're used to huge amounts
of magic. The normal response of novice programmers is to mandate
that users of their programs use the encoding of choice and put it in
ordinary strings so that it just works.
Ie, the average student just "eats" the F on the codecs assignment,
and writes the rest of her programs without them.
>> simple, and the exceptions for using a "nonexistent" method
>> mean I don't have to reinforce---the students will be able to
>> teach each other. The exceptions also directly help reinforce
>> the notion that text == Unicode.
Josiah> Are you sure that they would help? If .encode() and
Josiah> .decode() drop from strings and unicode (respectively),
Josiah> they get an AttributeError. That's almost useless.
Well, I'm not _sure_, but this is the kind of thing that you can learn
by rote. And it will happen on a sufficiently regular basis that a
large fraction of students will experience it. They'll ask each
other, and usually they'll find a classmate who knows what happened.
I haven't tried this with codecs, but that's been my experience with
statistical packages where some routines understand non-linear
equations but others insist on linear equations.[1] The error messages
("Equation is non-linear! Aaugh!") are not much more specific than
AttributeError.
Josiah> Raising a better exception (with more information) would
Josiah> be better in that case, but losing the functionality that
Josiah> either would offer seems unnecessary;
Well, the point is that for the "usual suspects" (ie, Unicode codecs)
there is no functionality that would be lost. As MAL pointed out, for
these codecs the "original" text is always Unicode; that's the role
Unicode is designed for, and by and large it fits the bill very well.
With few exceptions (such as rot13) the "derived" text will be bytes
that peripherals such as keyboards and terminals can generate and
display.
Josiah> "You are trying to encode/decode to/from incompatible
Josiah> types. expected: a->b got: x->y" is better. Some of those
Josiah> can be done *very soon*, given the capabilities of the
Josiah> encodings module,
That's probably the way to go.
If we can have a derived "Unicode codec" class that does this, that
would pretty much entirely serve the need I perceive. Beginning
students could learn to write iconv.py, more advanced students could
learn to create codec stacks to generate MIME bodies, which could
include base64 or quoted-printable bytes -> bytes codecs.
Footnotes:
[1] If you're not familiar with regression analysis, the problem is
that the equation "z = a*log(x) + b*log(y)" where a and b are to be
estimated is _linear_ in the sense that x, y, and z are data series,
and X = log(x) and Y = log(y) can be precomputed so that the equation
actually computed is "z = a*X + b*Y". On the other hand "z = a*(x +
b*y)" is _nonlinear_ because of the coefficient on y being a*b.
Students find this hard to grasp in the classroom, but they learn
quickly in the lab.
I believe the parameter/variable inversion that my students have
trouble with in statistics is similar to the "original"/"derived"
inversion that happens with "text you can see" (derived, string) and
"abstract text inside the program" (original, Unicode).
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev
mailing list