[Python-Dev] bytes.from_hex()

Mon Feb 20 05:28:41 CET 2006

"Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> 
> >>>>> "Josiah" == Josiah Carlson <jcarlson at uci.edu> writes:
> 
>     Josiah> The question remains: is str.decode() returning a string
>     Josiah> or unicode depending on the argument passed, when the
>     Josiah> argument quite literally names the codec involved,
>     Josiah> difficult to understand?  I don't believe so; am I the
>     Josiah> only one?
> 
> Do you do any of the user education *about codec use* that you
> recommend?  The people I try to teach about coding invariably find it
> difficult to understand.  The problem is that the near-universal
> intuition is that for "human-usable text" is pretty much anything *but
> Unicode* will do.  This is a really hard block to get them past.
> There is very good reason why Unicode is plain text ("original" in
> MAL's terms) and everything else is encoded ("derived"), but students
> new to the concept often take a while to "get" it.

I've not been teaching Python; when I was still a TA, it was strictly
algorithms and data structures.  Of those people who I have had the
opportunity to entice into Python, I've not followed up on their
progress to know if they had any issues.

I try to internalize it by not thinking of strings as encoded data, but
as binary data, and unicode as text.  I then remind myself that unicode
isn't native on-disk or cross-network (which stores and transports bytes,
not characters), so one needs to encode it as binary data.  It's a
subtle difference, but it has worked so far for me.

In my experience, at least for only-English speaking users, most people
don't even get to unicode.  I didn't even touch it until I had been well
versed with the encoding and decoding of all different kinds of binary
data, when a half-dozen international users (China, Japan, Russia, ...)
requested its support in my source editor; so I added it.  Supporting it
properly hasn't been very difficult, and the only real nit I have
experienced is supporting the encoding line just after the #! line for
arbitrary codecs (sometimes saving a file in a particular encoding dies).

I notice that you seem to be in Japan, so teaching unicode is a must. 
If you are using the "unicode is text" and "strings are data", and they
aren't getting it; then I don't know.

> Maybe it's just me, but whether it's the teacher or the students, I am
> *not* excited about the education route.  Martin's simple rule *is*
> simple, and the exceptions for using a "nonexistent" method mean I
> don't have to reinforce---the students will be able to teach each
> other.  The exceptions also directly help reinforce the notion that
> text == Unicode.

Are you sure that they would help?  If .encode() and .decode() drop from
strings and unicode (respectively), they get an AttributeError.  That's
almost useless.  Raising a better exception (with more information)
would be better in that case, but losing the functionality that either
would offer seems unnecessary; which is why I had suggested some of the
other method names.  Perhaps a "This method was removed because it
confused users.  Use help(str.encode) (or unicode.decode) to find out
how you can do the equivalent, or do what you *really* wanted to do."

> I grant the point that .decode('base64') is useful, but I also believe
> that "education" is a lot more easily said than done in this case.

What I meant by "education" is 'better documentation' and 'better
exception messages'.  I didn't learn Python by sitting in a class; I
learned it by going through the tutorial over a weekend as a 2nd year
undergrad and writing software which could do what I wanted/needed.
Compared to the compiler messages I'd been seeing from Codewarrior and
MSVC 6, Python exceptions were like an oracle.  I can understand how
first-time programmers can have issues with *some* Python exception
messages, which is why I think that we could use better ones.  There is
also the other issue that sometimes people fail to actually read the
messages.

Again, I don't believe that an AttributeError is any better than an
"ordinal not in range(128)", but "You are trying to encode/decode
to/from incompatible types. expected: a->b got: x->y" is better.  Some
of those can be done *very soon*, given the capabilities of the
encodings module, and they could likely be easily migrated, regardless
of the decisions with .encode()/.decode() .

 - Josiah