[Python-bugs-list] [ python-Bugs-681960 ] Source encoding rules are extreme.

SourceForge.net noreply@sourceforge.net
Wed, 12 Feb 2003 08:24:54 -0800


Bugs item #681960, was opened at 2003-02-07 01:17
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=681960&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Closed
Resolution: None
Priority: 3
Submitted By: Kirill Simonov (kirill_simonov)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Source encoding rules are extreme.

Initial Comment:
According to the PEP 0263, a source code that contains
non-ASCII
characters (ord(ch)>127) and does not define an
encoding causes
DeprecationWarning. In the future, such code will cause
SyntaxError.

While I believe that the idea of defining source code
encoding is very
useful, I think that the current solution is
unnecessary extreme.

It is very unfriendly for beginners. Imagine a student that
types her first script:

name = raw_input("What's your name? ")   # russian
here, of course
print "Hi %s!" % name

Do not even try to convince me that she must define an
encoding
here. That feature would break any possibility to use
Python in schools.

Actually the source code encoding only affects Unicode
literals.
The above script works the same way with any defined
encoding,
so the warning for this code is unnecessary.

As a solution, I propose to issue DeprecationWarning
(or SyntaxError)
only when a non-ASCII character is contained in a
Unicode literal.


----------------------------------------------------------------------

Comment By: Denis S. Otkidach (ods)
Date: 2003-02-12 19:24

Message:
Logged In: YES 
user_id=63454

encode/decode is slow compared to translate. Octal/hexadecimal escapes 
are OK. I've noticed that defining arbitrary encoding of source allows 
arbitrary binary data in stings (a bit ugly, but is OK when this setting is 
hidden in site.py), so there is no problem even for old code.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-12 18:28

Message:
Logged In: YES 
user_id=38388

You shouldn't put binary data into Python source files 
to begin with. If you absolutely must, then base64 provides
a good start for an ASCII-encoding. The other alternative
is using Python octal escapes. Both are fast.

I don't know where you get the idea from that 
encode/decode are slow. They are certainly faster than
first building a list of ints in memory and then applying
map() to the list.

----------------------------------------------------------------------

Comment By: Denis S. Otkidach (ods)
Date: 2003-02-12 18:05

Message:
Logged In: YES 
user_id=63454

Hmm... There no type for byte streams in Python anymore? Too much to 
change in existing code.  Base64 is not the best solution - too many 
unwanted and slow operations. There are too many areas where we need 
literals for binary data. One more example: translation tables for different 
encodings. Yea, I know about unicode/encode/decode etc, but they are 
_very_ slow for many applications. Use map(ord, [...list of ints...])?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-12 17:49

Message:
Logged In: YES 
user_id=38388

Encode the 8-bit data as base64 value and put that into the
source code.

----------------------------------------------------------------------

Comment By: Denis S. Otkidach (ods)
Date: 2003-02-12 17:36

Message:
Logged In: YES 
user_id=63454

8-bit string in Python is just a stream of bytes now. Why should I specify 
encoding for inline image data for instance? And what encoding should I 
use?

----------------------------------------------------------------------

Comment By: Kirill Simonov (kirill_simonov)
Date: 2003-02-10 18:39

Message:
Logged In: YES 
user_id=36553

I like this. Thanks.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-10 12:43

Message:
Logged In: YES 
user_id=38388

I've had a private discussion with Guido and Roman Suzi:

We'll add a way to set the source code default encoding via the
site.py/sitecustomize.py files. This should then allow anyone
wishing to customize the default behaviour to do so.

----------------------------------------------------------------------

Comment By: Kirill Simonov (kirill_simonov)
Date: 2003-02-07 02:28

Message:
Logged In: YES 
user_id=36553

Hello,

Yes, I understand that the encoding is for the whole source
file.

But

1. The current implementation already assumes that one uses
ASCII-
compatible encoding. So we can make a step further and do
not use any
encoding while reading a source file. And then we'll
translate u"..." using
'ascii' encoding.

2. How do you want to support UTF-16 encoding? This will
completely
break ordinary string literals! "aa" is a source code would
become "a\x00a\x00" after compilation. Or do I miss something?

3. Do not forget that your change breaks billions of scripts
that use
non-ASCII characters even in comments!

4. I can write a patch. I would be forced to do this anyway.



----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-07 01:45

Message:
Logged In: YES 
user_id=38388

Sorry, but the implementation we chose decodes the complete
file,
not only the Unicode literals, so if you want to use a specific 
encoding in the source code, you have to be explicit about it.

Python's source code was originally never meant to contain
non-ASCII characters. The PEP implementation now officially
allows this provided that you use an encoding marker, e.g.

"""
# -*- coding: windows-1251 -*-
name = raw_input("   ? ")
print " %s" % name
"""

Note that this is also needed in order to support UTF-16
file formats which use two bytes per character. Python
will automatically detect these files, so if you really don't
like the coding marker, simply write the file using a UTF-16
aware editor which prepends a UTF-16 BOM mark to the
file.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=681960&group_id=5470