[Python-3000] Support for PEP 3131

Fri May 25 16:56:52 CEST 2007

On 5/24/07, Guillaume Proux <gproux+py3000 at gmail.com> wrote:
> Hi Jim,
> On 5/25/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > It isn't strictly security; when I've been burned by cut-and-paste
> > that turned out to be an unexpected character, it didn't cause damage,
> > but it did take me a long time to debug.

> Can you give a longer explanation because I don't understand what is
> the issue. Is it like the issue with confusing 0 and O ? You seemingly
> already have an experience with using something that is now not legal
> in Python. Was it in Java or .NET world?

The really hard-to-debug ones were usually in C.  It happened more
when I was less experienced, or the available tools were limited.

They usually involved something that looked like a quote mark, but
wasn't.  (I worry about the characters that look like a less-than
sign, but I've never had trouble with them in practice.  Problems with
other punctuation were rare enough that I can't say they were worse
than "." vs "," or ":" vs ";".)

This would be less of a problem in python because it takes
triple-quotes to continue a line string across multiple lines -- but
it would still be an occasional problem.

This would be less of a problem if I had started out smarter, or I if
never worked with people who used presentation-focused editors (like
MS Word) when discussing code, but those are only theoretical
possibilities.

> > For most people, the appearance of a Greek or Japanese (let alone
> > both) character would be more likely to indicate a typo.  If you know
> > that your project is using both languages, then just allow both; the
> > point is that you have made an explicit decision to do so.

> * Python is dynamic (you can have a e.g. pygtk user interface which
> enables you to load at runtime a new .py file even to use a text view
> to type in a mini-script that will do something specific in your
> application domain): you never know what will get loaded next

I am not missing that -- that is the situation I worry about *most*.
If I'm running something that new, and I've only inspected it
visually, I want a great big warning about unexpected characters that
merely look like what I thought they were.

No, this won't happen often -- but like threading race conditions,
that almost makes it worse.  Because it is rare, people won't remember
to check for it unless the check is an automated default.

If I were in a Japanese environment, regularly getting code written in
Japanese, then Japanese code would be fine, so I would set my
environment to accept Japanese -- but I would still get that warning
for something with that appears Latin but actually contains Cyrillic.

> * Python is embeddable: and often it is to bring the power of python
> to less sophisticated users. You can imagine having a global system
> deployed all around the world by a global company enabling each user
> in each subsidiary to create their own extension scripts.

If they can supply their own scripts, they can supply their own data
files -- including an acceptable characters table.  But they wouldn't
really need to -- realistically, the acceptable characters would be a
corporate (or at least site-wide) policy decision that could be set at
install time.

> * There is a runtime cost for checking: the speed vs. security
> tradeoff

True, but if speed is that important, than ASCII-only is better; the
initial file reading will happen faster, as will the parsing to
characters, and the deciding whether characters can be part of an
identifier.  Even a blind "Anything code point greater than 127 is
always allowed" is still slower than not having to consider those code
points.

Once you start saying "letters and digits only", you need a
per-character lookup, and the difference between "in this set of 4000
out of several million" vs "in this set of several million out of
several more million" doesn't need to slow things down.

> (for a security benefit that is still very much hypothetical
> in the face of the experience of Java and .NET people)

(a)  Aren't those compile languages, rather than interpreted?  So a
misleadingly-named identifier doesn't matter as much,  because people
aren't looking at the source anyhow.
(b)  How do you know there haven't been problems that just weren't
caught?  (Perhaps more of the "wonder why that errored out" variety
than security breaches.)

> * In real life, you won't see much python programs that are not
> written in your script.

Exactly.  So when you do, they should be flagged.

-jJ