2.3 encoding parsing bug: calmer thoughts

Thu Feb 19 09:00:05 EST 2004

First, my apologies to python-dev et. al. for my irritable remarks re pep
263, http://www.python.org/peps/pep-0263.html and thanks to Michael Hudson
and Jeff Epler for their warm-hearted and generous responses to my
outbursts.  It's so much easier to think now that there is no "vendetta"
going on :-)

This morning in the shower I realized that far from being "abused" by pep
263, Leo is, or will be, the beneficiary of pep 263.  Indeed, having Python
recognize an encoding field in an #@+leo line is exactly what Leo's users
would want: it saves them from writing their own # -*- coding: <encoding
name> -*- line.

The reason Leo ran afoul of Python 2.3 is that Leo presently terminates the
encoding field with a period.  Alas, periods may appear in encoding names.
Leo's convention is just wrong, so regardless of pep 263 Leo's file formats
will have to change in order to properly handle names such as
'japanese.sjis'.

The only remaining question in my mind is this:  how likely is it for a user
to "innocently" match the regular expression "coding[:=]\s*([\w-_.]+)" by
mistake?  I see now that Leo doesn't refute the assertion that it's not very
likely.  Indeed, Leo's syntax _should_ have matched this re: the problem
arose not from any defect in pep 263 but from a very real bug in Leo.

In short, my opinion of pep 263 has undergone an almost 180 degree
turnaround.  I like it, Leo's users will benefit from it, and it seems
unlikely that other people's existing code will suffer.  Indeed, one would
typically expect an initial line containing "coding:" or "coding=" to be
followed by a valid Unicode encoding.

Two other thoughts:

1. The summaries of pep 263 such as
http://www.python.org/doc/2.3.3/whatsnew/section-encodings.html are not
accurate, that is, they do not affect what really happens.  IMO, it would
make more sense to describe the re in English (as well as give the actual
re) and to give the rationale for making the re fairly general.

2. I wonder if it makes sense to do something besides throwing a syntax
error if the encoding isn't recognized.  I suspect this topic has already
been discussed.  Can anyone summarize it for me?

Many thanks to all who have responded, publicly or in private, to me on this
subject.

Edward
--------------------------------------------------------------------
Edward K. Ream   email:  edreamleo at charter.net
Leo: Literate Editor with Outlines
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------