PEP 263 -- Python Source Code Encoding

I consider the above PEP ready for review by the developers. Please comment. http://python.sourceforge.net/peps/pep-0263.html After approval, the next step would be to implement phase 1 for 2.3. Step two would then be on the plate for 2.4. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

That looks OK to me. I the Emacs-style comment in fact compatible with Emacs? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
According to Martin, it is compatible. If it's not we'll make it so :-) Barry ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> According to Martin, it is compatible. If it's not we'll make MAL> it so :-) MAL> Barry ? I believe so, although I haven't ever used this trick to specify the file's encoding. In some quick tests, at least XEmacs doesn't bomb out on it (if I stick a real encoding in for <encoding name>). -Barry

Guido van Rossum <guido@python.org> writes:
That looks OK to me. I the Emacs-style comment in fact compatible with Emacs?
It is. I expect many people want to put "utf-8" as the encoding name, and you need Emacs 21 for that (or Emacs with Mule-UCS, or some such). In GNU Emacs, you see the effect of the coding: directive in the Emacs status line. Just try the attached file, it will indicate "R" for KOI8-R. Not sure about XEmacs. Regards, Martin

Cool! It worked for me in Emacs, but not in XEmacs. Oh well. --Guido van Rossum (home page: http://www.python.org/~guido/)

>> That looks OK to me. I the Emacs-style comment in fact compatible >> with Emacs? Martin> It is. I expect many people want to put "utf-8" as the encoding Martin> name, and you need Emacs 21 for that (or Emacs with Mule-UCS, or Martin> some such). Martin> In GNU Emacs, you see the effect of the coding: directive in the Martin> Emacs status line. Just try the attached file, it will indicate Martin> "R" for KOI8-R. Not sure about XEmacs. I use XEmacs 21.4.5 (non-MULE). I see nothing particularly interesting when visiting that file. Apropos doesn't indicate there is a variable named "coding" either. I see ":encoding", an undocumented variable. Everything else containing "coding" is more complex and seems package-specific (tramp, vm, ediff, etc). Perhaps using MULE would make a difference. Skip

"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
MvL> Guido van Rossum <guido@python.org> writes: >> That looks OK to me. I the Emacs-style comment in fact >> compatible with Emacs? MvL> It is. I expect many people want to put "utf-8" as the MvL> encoding name, and you need Emacs 21 for that (or Emacs with MvL> Mule-UCS, or some such). MvL> In GNU Emacs, you see the effect of the coding: directive in MvL> the Emacs status line. Just try the attached file, it will MvL> indicate "R" for KOI8-R. Not sure about XEmacs. I don't think it works for XEmacs. I've got a MULE-aware XEmacs 21.4.6 and while it asks if I want to set the local variables in the -*- line, I still see "Raw" in the modeline, and I see the following letters in print string (with funny little lines above the characters): iAOOEI. See attached capture. That doesn't seem right, does it? -Barry

"Martin v. Loewis" wrote:
After reading some of the Emacs docs, I think we should allow a more flexible coding line: -*- ... coding: (\w+) ... -*- because you will sometimes want to add more variables to that Emacs init line than just the encoding declaration. Does anybody know where XEmacs is moving w/r to this ? (and for that matter what about vi, vim, etc. ?) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Tue, Feb 26, 2002 at 08:50:35PM +0100, M.-A. Lemburg wrote:
Does anybody know where XEmacs is moving w/r to this ? (and for that matter what about vi, vim, etc. ?)
I'm working with Vim 6.0, 20001 Sep 14. VIM lets you set variables with text similar to vim:KEY=VALUE:KEY=VALUE:....: Apparently you would use vim:fileencoding=sjis: to select shift-jis encoding. In the vim style, it seems most common to place this at the bottom of a file, but it can be placed at the top too. The variable "modelines" controls how many lines at each end of the file is inspected, with the default being 5. It's documented that the form vi:set KEY=VALUE: may be compatible with "some versions of Vi" but does not say which. (I can't get this to work) You can set a list of encodings to attempt when a file is loaded, which defaults to "ucs-bom,utf-8,latin1". A user who wanted to treate non-unicode files as shift-jis by default would :set fileencodings=ucs-bom,utf-8,sjis You can also load a particular file with the ++enc parameter: :edit ++enc=koi8-r russian.txt (I can get this to work, but I have to do it manually to load anything in an odd character set) The emacs line is harmless in vim, but doesn't do anything. It's possible that using :autocmd someone could make vim use the emacs line to set encoding, but I'm not sure -- setting fileencoding after a file is loaded seems to perform a translation from the old characterset to the new. Jeff

jepler@unpythonic.dhs.org wrote:
So if we use the RE "coding[=:]\s*([\w-]+)" on the first line, we should be able to reach out for the encoding, right ? This RE would then cover both vim and emacs. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Wed, Feb 27, 2002 at 10:20:57AM +0100, M.-A. Lemburg wrote:
I've been informed on a #vim irc channel that "vim:fillencoding=blah:" does not work. Unfortunate. I overlooked the part of the documentation which states To read a file in a certain encoding it won't work by setting 'fileencoding', use the |++enc| argument. However, there's a "charset plugin" for vim: http://vim.sourceforge.net/scripts/script.php?script_id=199 which could be adapted to follow whatever convention is chosen for Python. However, this plugin is not standard in any version of vim. It's not clear what license it's under, but referencing it from the PEP and documenting that something like au BufReadPost *.py ReloadWhenCharset(1, "coding[:=]\s([\w-]+)") au BufReadPost *.py ReloadWhenCharset(2, "coding[:=]\s([\w-]+)") (search the first two lines for the emacs coding special marker) would cause it to detect the charset of a Python file would certainly be possible. The plugin functions by executing a reload of the file with ++enc when ReloadWhenCharset matches its pattern. Jeff

This actually works in vim with "charset plugin": let s:pep263='coding[:=]\s*\([-A-Za-z0-9_]\+\)' au BufReadPost *.py call ReloadWhenCharsetSet(1, s:pep263) au BufReadPost *.py call ReloadWhenCharsetSet(2, s:pep263) It searches for a RE compatible with PEP263 in the first and second lines. You could change the pattern from *.py to * if you want to recognize the emacs-style coding in all files. Jeff

Jeff Epler wrote:
Great ! So we can say that the RE fits vim and emacs, right ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[MAL]
I consider the above PEP ready for review by the developers. Please comment.
The pep seems to dictate that the source by default must be read as latin-1: """ Python will default to Latin-1 as standard encoding if no other encoding hints are given. """ Jython already reads the python source with the default java encoding which usually depends on the PCs locale. If a small loophole could be added to that requirement, then the pep have my full support. regards, finn

I missed this. Why not default to ASCII like any decent programming language does in the absence of an explicit encoding? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Jack had the same question. The simple answer is: we need this in order to maintain backward compatibility when we move to phase two of the implementation. Here's the longer one: ASCII is the standard encoding for Python keywords and identifiers. There is no standard source code encoding for string literals. Unicode literals are interpreted using 'unicode-escape' which is an enhanced Latin-1 with escape semantics. This makes Latin-1 the right choice: * Unicode literals already use it today * As soon as we get to phase two of the implementation, 8-bit string literals will be have to make the round trip raw binary -> Unicode -> raw binary and this only works if you make Latin-1 the default. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

But they shouldn't, IMO. We should require an explicit encoding when more than ASCII is used, and I'd like to enforce this.
Sorry, I don't understand what you're trying to say here. Can you explain this with an example? Why can't we require any program encoded in more than pure ASCII to have an encoding magic comment? I guess I don't understand why you mean by "raw binary". Once you've explained it to me, the PEP should address this issue. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org> writes:
I agree. I recommend to deprecate this feature, and raise a DeprecationWarning if a Unicode literal contains non-ASCII characters but no encoding has been declared.
With the proposed implementation, the encoding declaration is only used for Unicode literals. In all other places where non-ASCII characters can occur (comments, string literals), those characters are treated as "bytes", i.e. it is not verified that these bytes are meaningful under the declared encoding. Marc's original proposal was to apply the declared encoding to the complete source code, but I objected claiming that it would make the tokenizer changes more complex, and the resulting tokenizer likely significantly slower (atleast if you use the codecs API to perform the decoding). In phase 2, the encoding will apply to all strings. So it will not be possible to put arbitrary byte sequences in a string literal, atleast if the encoding disallows certain byte sequences (like UTF-8, or ASCII). Since this is currently possible, we have a backwards compatibility problem. Regards, Martin

I would say that any program that currently uses non-ASCII in string literals (whether Unicode or 8-bit literals) is strictly spoken undefined. For cases where a specific encoding is used, the solution is easy: add an explicit encoding. Other cases are simply garbage and should use \xDD escapes instead. Maybe an implementation phase 1a should be introduced that warns about the occurrence of non-ASCII characters anywhere in the source code when no encoding is specified. --Guido van Rossum (home page: http://www.python.org/~guido/)

"Martin v. Loewis" wrote:
I don't think that the codecs will significantly slow down overall compilation -- the compiler is not fast to begin with. However, changing the bsae type in the tokenizer and compiler from char* to Py_UNICODE* will be a significant effort and that's why I added two phases to the implementation. The first phase will only touch Unicode literals as proposed by Martin.
Right and I believe that a lot of people in European countries write strings literals with a Latin-1 encoding in mind. We cannot simply break all that code. The other problem is with comments found in Python source code. In phase 2 these will break as well. So how about this: In phase 1, the tokenizer checks the *complete file* for non-ASCII characters and outputs single warning per file if it doesn't find a coding declaration at the top. Unicode literals continue to use [raw-]unicode-escape as codec. In phase 2, we enforce ASCII as default encoding, i.e. the warning will turn into an error. The [raw-]unicode-escape codec will be extended to also support converting Unicode to Unicode, that is, only handle escape sequences in this case. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
Do you suggest that in this phase, the declared encoding is not used for anything except to complain? -1. I think people need to gain something from declaring the encoding; what they gain is that Unicode literals work right (i.e. that they really denote the strings that people see on their screen - given the appropriate editor). Regards, Martin

"Martin v. Loewis" wrote:
No. This is just an extra step on top of what is proposed in the PEP to make people aware of the problem in phase 1. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I just got a private response about the proposal from Atsuo Ishimoto, Japan. They use two different encoding in day-to-day life (one for windows, one for unix) and have their complete tool chain setup to auto-convert all files between the two environments. Recognizing the magic comment would pose a problem for them, since their tools assume conversion to the PC's locale setting. He proposed to make the interpreters default encoding the default for source files which don't specify an encoding. That is ASCII on all standard Python installations and different encodings on tweaked installations. He also told me that they put raw Shift-JIS and EUC-JP into Python literal strings -- just like Europeans do with Latin-1. Wouldn't his suggestion be a good compromise for phase 2 ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I'm OK with a way to change the default to something locale-specific, as long as there's also a way to make the default strict ASCII (for export). Maybe python -A could force the default encoding to be ASCII even if the locale specifies something different. (I'd still *prefer* it the other way around, where you have to specify an explicit option to make the default equal to the locale rather than ASCII, but I can see the other side. Sigh.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Let's put it this way: the interpreter's default encoding has to be changed explicitly by the sys admin (in sitecustomize.py), so the decision to take e.g. a locale specific default encoding is one which the admin maintaining the installation has to make (with all the consequences that go with it). Per default, the default encoding is ASCII, so I don't think we really need an extra option. Hmm, could be that python -S already implies this, BTW... checking this reveils that even sys.setdefaultencoding() remains available if -S is used. Perhaps we should remove the API with -S too ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

OK. I missed that part -- I thought that it would look in the locale by default.
Per default, the default encoding is ASCII, so I don't think we really need an extra option.
Agreed.
Hmm, could be that python -S already implies this, BTW...
:-)
I don't think so. It should be left in, caveat emptor. --Guido van Rossum (home page: http://www.python.org/~guido/)

The problem I have with PEP 263 right now is that the "-*- coding: -*-" magic is really sort of being abused. I gather that "coding:" is supposed to specify the encoding (what MIME calls "charset") of the file. But under PEP 263, it only refers to the Unicode string literals within the program. Everything else must still be treated as 8-bit text. For example, I'm not sure what effect "coding: utf-16" would have. (?) For another example, if you have UTF-8 Unicode string literals in your program but you also have 8-bit Latin-1 plain str string literals in the same program, how should you mark it? How will Emacs then treat it? Is a Python program an 8-bit string or a Unicode string? Right now, although perhaps someone who knows more about the parser than I can expand on this, it seems that Python programs are 8-bit strings. Therefore I argue that it makes no sense to use "coding:" to label a Python file, because the file doesn't consist of Unicode text. ## Jason Orendorff http://www.jorendorff.com/

Jason wrote:
The problem I have with PEP 263 right now is that the "-*- coding: -*-" magic is really sort of being abused.
really?
from the current version (revision 1.9) of the PEP: "The complete Python source file should use a single encoding."
For example, I'm not sure what effect "coding: utf-16" would have. (?)
"Only ASCII compatible encodings are allowed."
"Embedding of differently encoded data is not allowed"
"the proposed solution should be implemented in two phases: 1. Implement the magic comment detection and default encoding handling, but only apply the detected encoding to Unicode literals in the source file. 2. Change the tokenizer/compiler base string type from char* to Py_UNICODE* and apply the encoding to the complete file." </F>

"Jason Orendorff" <jason@jorendorff.com> writes:
Not really. If you are willing to separate the language and its implementation, then I'd phrase the intent that way: - if an encoding is declared, all of the file must follow that encoding (all of them, always (*)) - in phase 1, the implementation will not verify that property, except for Unicode literals - in phase 2, Python will implement Python completely in this respect.
For example, I'm not sure what effect "coding: utf-16" would have. (?)
Invalid; source encodings must be an ASCII superset (not sure how the implementation will react to that; if the file really is UTF-16, you'll get a syntax error, if you say it is UTF-16 but it isn't, Python will reject it in phase 2).
You should mark the file as UTF-8. In phase 2, Python will reject it. At that point, you should convert your latin-1 string literal into hex escapes - it is binary data then, not Latin-1.
How will Emacs then treat it?
Don't know - just try. You cannot create such a file with Emacs.
Is a Python program an 8-bit string or a Unicode string?

"M.-A. Lemburg" <mal@lemburg.com> writes:
I expected that much; chosing Latin-1 as the default encoding is certainly Euro-centric. At the moment, declaring either eucJP or or Shift-JIS wouldn't work with the proposed implementation, anyway, since those encodings are not supported in the standard Python installation.
Wouldn't his suggestion be a good compromise for phase 2 ?
This raises the question what exactly should be deprecated. AFAIK, both eucJP and Shift-JIS use non-ASCII bytes to denote Japanese characters, so they'd get a DeprecationWarning on every file. However, they could not put an encoding declaration into the file, as Python would not recognize the encoding. I don't see the convention to convert as too much of a stumbling block; to my knowledge, many editors can display text in both encodings correctly these days (but I may be wrong with that assumption). Regards, Martin

"Martin v. Loewis" wrote:
But they will be using Tamito's Japanese codecs... and, of course, they do work now in string literals, since there is no enforcement of any encoding in the compiler.
With Tamito's codecs installed, this wouldn't be a problem. Putting the encoding comment in the files will turn the compiler quiet in phase 1 and in phase 2 assure that their editors do in fact use the defined encoding. FYI, I've updated the PEP to use the interpreter's default encoding as basis for the source file encoding too. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
MvL> At the moment, declaring either eucJP or or Shift-JIS MvL> wouldn't work with the proposed implementation, anyway, since MvL> those encodings are not supported in the standard Python MvL> installation. Which actually touches on something I wanted to bring up. Why don't we include the Japanese codecs with Python? Is it just a size issue? The gzip'd tarball of the JapaneseCodecs-1.4.3 is 258k, unpacked it's 3.2M. Okay, so that's nontrivial, but I can think of 2 approaches: - Have a second, sumo (no pun intended) release that inclues the codecs - Include the gzip'd tarball and do a distutils install at Python install time I bet we'd win some Ruby converts if we did this <wink>. For reference, I'm thinking about including the Japanese and Chinese codecs with MM2.1 because it makes little sense to claim support for those languages without them. -Barry

"Barry A. Warsaw" wrote:
Why not simply make the installation a configure option ? We could easily extend setup.py to grab the tarball from the web in case it is needed.
Agreed. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> Why not simply make the installation a configure option ? MAL> We could easily extend setup.py to grab the tarball from MAL> the web in case it is needed. That's another option. Certainly stuff like that is becoming fairly common for installers these days. >> I bet we'd win some Ruby converts if we did this <wink>. For >> reference, I'm thinking about including the Japanese and >> Chinese codecs with MM2.1 because it makes little sense to >> claim support for those languages without them. MAL> Agreed. -Barry

"Barry A. Warsaw" wrote:
Hmm, make that ZIP-ball (we have no .tar support in the standard lib, only ZIP-file support). Also, the setup.py will have to check whether it has to grab a level 0 compression ZIP file or a level 9 one. Nothing which cannot be done, of course... net installers are quite common these days (see e.g. Mozilla, IE and others), so people are probably quite used to them already. And we can always provide a full install download as well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> Hmm, make that ZIP-ball (we have no .tar support in the MAL> standard lib, only ZIP-file support). Also, the setup.py will MAL> have to check whether it has to grab a level 0 compression MAL> ZIP file or a level 9 one. MAL> Nothing which cannot be done, of course... net installers are MAL> quite common these days (see e.g. Mozilla, IE and others), so MAL> people are probably quite used to them already. And we can MAL> always provide a full install download as well. Isn't there some PEP about all this? <wink> -Barry

barry@zope.com (Barry A. Warsaw) writes:
Which actually touches on something I wanted to bring up. Why don't we include the Japanese codecs with Python? Is it just a size issue?
I think Guido's original concern was about the size (apart from the fact that they were not available before). My concern is also correctness and efficiency. Most current systems provide high-performance well-tested codecs, since they need those frequently. It is a waste of resources not to make use of these codecs. The counter-argument, of course, is that you cannot always rely on these codecs being available (apart from the fact that you need wrappers around the platform API).
That is certainly the right thing to do. If correctness could be verified independently, I'd be in favour of including them with Python - even though they will likely never get the efficiency that wrappers around the platform's codecs would have. Regards, Martin

"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
>> I bet we'd win some Ruby converts if we did this <wink>. For >> reference, I'm thinking about including the Japanese and >> Chinese codecs with MM2.1 because it makes little sense to >> claim support for those languages without them. MvL> That is certainly the right thing to do. If correctness could MvL> be verified independently, I'd be in favour of including them MvL> with Python - even though they will likely never get the MvL> efficiency that wrappers around the platform's codecs would MvL> have. I'm obviously not qualified to verify them independently, but I have had some initial positive feedback from a few Japanese users of the MM2.1 alphas. My second hand information indicates that he Japanese codecs are pretty good, the Chinese are okay, and the Korean ones need a lot of work. Also, it's a bit of a catch 22, in that the more official exposure these codecs get, the better they will eventually become, hopefully. I'd be +1 on including them in Python 2.3. -Barry

"Barry A. Warsaw" wrote:
You could (and probably should) add Tamito's codecs in Python, but the others have licensing problems :-/ It shouldn't be hard though for native speakers and programmers to build upon the work of Tamito and get those codecs done as well. Alternatively, the PSF or some company interested in having these codecs available could fund the development. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
You could (and probably should) add Tamito's codecs in Python, but the others have licensing problems :-/
I would not recommend to incorporate any of this into Python without asking the author(s). When doing so, it would be appropriate, IMO, to ask them whether they would fill out the contributor agreement. Then, the presumed licensing problems would be gone. Regards, Martin

[This thread probably ought to be moved to i18n-sig, so I'm CC'ing them and will remove all future cc's to python-dev. -BAW]
"MAL" == M <mal@lemburg.com> writes:
MAL> You could (and probably should) add Tamito's codecs in MAL> Python, but the others have licensing problems :-/ I believe I am using Tamito KAJIYAMA's codecs, from: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ Or were you thinking about some different Japanese codecs? The ones at this url are BSD-ish and so should be compatible with the PSF license, GPL, etc. MAL> It shouldn't be hard though for native speakers and MAL> programmers to build upon the work of Tamito and get those MAL> codecs done as well. Alternatively, the PSF or some company MAL> interested in having these codecs available could fund the MAL> development. All good points. I still think that by giving more visibility to the codecs (i.e. adding them to the Python distro) would help bring muscle to the effort.
"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
MvL> I would not recommend to incorporate any of this into Python MvL> without asking the author(s). When doing so, it would be MvL> appropriate, IMO, to ask them whether they would fill out the MvL> contributor agreement. Then, the presumed licensing problems MvL> would be gone. Agreed on both points! -Barry

I've been working on a unified architecture for the Asian codecs. I presented a paper about it at the last Unicode Conference in Washington D.C. You can find it at http://www.basistech.com/articles/python-zh-transcoding_iuc20_TE2.pdf The presentation concentrates on Chinese, but the architecture will work for JK as well. -tree -- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"

"Barry A. Warsaw" wrote:
+1. The PSF will have to agree on the contribution docs first, though. Since there's no discussion on the PSF docs discussion list, I suppose everybody is happy with them :-) BTW, I was referring to the other codecs in the python-codecs project on SF. Most of those are encumbered by the GPL and thus unusable in non-GPL projects. Tamito has switched to a BSD-license after some private discussions about this, which is goodness :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> The PSF will have to agree on the contribution docs first, MAL> though. Since there's no discussion on the PSF docs MAL> discussion list, I suppose everybody is happy with them :-) I am. What do we need to do next? MAL> BTW, I was referring to the other codecs in the python-codecs MAL> project on SF. Most of those are encumbered by the GPL and MAL> thus unusable in non-GPL projects. MAL> Tamito has switched to a BSD-license after some private MAL> discussions about this, which is goodness :-)

"Barry A. Warsaw" wrote:
Wait. The deadline is mid-March. After that the docs will have to go to the lawyer and only then we can use them...
Me neither, but Tamito has put a lot of work into them and with his move to C for the codec engine, speed is not an issue anymore either. Also, I've asked him about his thoughts about having them included in the core before. He would be happy with that move. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> Wait. The deadline is mid-March. After that the docs will MAL> have to go to the lawyer and only then we can use them... Right, I forgot. ;) MAL> Me neither, but Tamito has put a lot of work into them and MAL> with his move to C for the codec engine, speed is not an MAL> issue anymore either. MAL> Also, I've asked him about his thoughts about having them MAL> included in the core before. He would be happy with that MAL> move. Cool! -Barry

"Martin v. Loewis" wrote:
Which wrapper APIs do we currently have which could actually be made part of the Python core ? Aside: while it's true that we could use those, the Unicode implementation has shown that rolling our own has worked out quite well too. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
Which wrapper APIs do we currently have which could actually be made part of the Python core ?
On Unix, we have iconv(3). On Windows, we have MultiByteToWideChar, which would need to be wrapped with a map translating codec names to codepage numbers. There is also a codec API through a COM interface provided by Internet Exploder; I don't have the name of that interface right now. On all platforms, we could easily wrap the Tcl encodings, which are available everywhere where Python is available. Not sure what the performance implications would be. There also could be a wrapper around ICU. On OS X, CFStringCreateFromExternalRepresentation could be used.
There have been a few correctness glitches in those, but overall, I'd agree that they have worked quite well. Performance is a different issue, though; people just haven't complained, yet, IMO. Regards, Martin

On Wednesday, February 27, 2002, at 10:16 , M.-A. Lemburg wrote:
+1 -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

[M.-A. Lemburg]
But there is: Python uses the 7-bit ASCII character set for program text and string literals. 8-bit characters may be used in string literals and comments but their interpretation is platform dependent; the proper way to insert 8-bit characters in string literals is by using octal or hexadecimal escape sequences. The Ref Man has said "7-bit ASCII" for both "program text and string literals" for a long time. The formal grammar in the Ref Man agrees with this (including the formal grammar for Unicode literals). It's an historical accident that the tokenizer happened to use C isalpha() to "enforce" this for identifiers, and that C isalpha() happened to grow locale-dependence while Guido was too drunk with power to notice <wink>.
Unicode literals are interpreted using 'unicode-escape' which is an enhanced Latin-1 with escape semantics.
I'm sure they *do* "act like" Latin-1 on your box, and that identifiers also act like Latin-1 was in effect on your box. But the Ref Man explicitly says all that is platform dependent; there's no "backward compatibility" to preserve here beyond 7-bit ASCII unless you want to preserve that Python always rely on what C isalpha() says.

Tim Peters wrote:
It's a fact of life that users don't read reference manuals, but simply write programs and feel good if they happen to work :-) As a result, programs have used string literals in many different encodings for a long time. Changing this situation will take time. The proposal aims at clarifying the situation and to make the transition less painful.
You tell that to the Russians, Japanese or the Europeans writing Python programs -- it just happens that comments and literals are bound to end up using local encodings. Anyway, with the PEP implemented we'll no longer have to restrict ourselves to 7-bit US-ASCII, so all these problems will go away. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I've updated the PEP with the new requirements. http://python.sourceforge.net/peps/pep-0263.html The new scheme for the default encoding now maps the standard procedure for all other conversions in Python which go from strings to Unicode: use the sys.getdefaultencoding(). This happens to be ASCII in all standard installations, but sys admins may change it at their own risk and liking. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

>> Python uses the 7-bit ASCII character set for program text and string >> literals. 8-bit characters may be used in string literals and >> comments but their interpretation is platform dependent; the proper >> way to insert 8-bit characters in string literals is by using octal >> or hexadecimal escape sequences. mal> It's a fact of life that users don't read reference manuals, but mal> simply write programs and feel good if they happen to work :-) Perhaps a warning should be emitted by the compiler if a plain string literal is found that contains 8-bit characters. Better yet, perhaps Neal can add this to PyChecker if he hasn't already... Skip

Skip Montanaro wrote:
See the PEP: this is what phase 1 will do; phase 2 won't accept such a file without an explicit encoding declaration. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

>> Perhaps a warning should be emitted by the compiler if a plain string >> literal is found that contains 8-bit characters. Better yet, perhaps >> Neal can add this to PyChecker if he hasn't already... mal> See the PEP: this is what phase 1 will do; phase 2 won't accept mal> such a file without an explicit encoding declaration. That wasn't what I was getting at. The quoted part of the reference manual seemed to suggest that programmers should be using hex escapes in string literals instead of 8-bit characters. This doesn't seem to me to be related to what encoding the file is in. Skip

Skip Montanaro <skip@pobox.com> writes:
PEP 263 says "the tokenizer must check the complete source file for compliance with the default encoding". The part of the reference manual will become incorrect: the meaning of 8-bit characters (rather: bytes) will be well-defined if you have an encoding declaration. If the default encoding is ASCII, and you have a 8-bit character, the compiler will emit a warning if it is enhanced to follow PEP 263. So what were you getting at? Regards, Martin

Martin> If the default encoding is ASCII, and you have a 8-bit Martin> character, the compiler will emit a warning if it is enhanced to Martin> follow PEP 263. So what were you getting at? I was thinking about strings used as byte containers for non-character data. Skip

Skip Montanaro wrote:
In string literals ? I think it is common to encode this sort of data as hex or using octal escapes. Since these encodings are plain 7-bit ASCII I don't see a problem. Your hint about the manual is correct though: we'll have to adapt that to the new reading as well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

>> I was thinking about strings used as byte containers for >> non-character data. mal> In string literals ? I think it is common to encode this sort of mal> data as hex or using octal escapes. Since these encodings are plain mal> 7-bit ASCII I don't see a problem. Precisely. I was thinking about situations where they aren't encoded, but sitting there naked, so to speak. Skip

Skip Montanaro <skip@pobox.com> writes:
I was thinking about strings used as byte containers for non-character data.
Ok, but then you also said that you would want to produce a warning for those? How can you tell them apart from "proper" character strings if the encoding allows arbitrary byte sequences (like Latin-1)? Regards, Martin

>> I was thinking about strings used as byte containers for >> non-character data. Martin> Ok, but then you also said that you would want to produce a Martin> warning for those? Never mind. I'm probably just confused. Skip

Finn Bock wrote:
Hmm, in phase two we will need to decode the source code file using some encoding into Unicode and then reencode the 8-bit string parts using that same encoding. The only requirement we have for that is round-trip safety, so that string literals turn out as the same value you see in the source file. Now, Unicode literals are explicit about this: unicode-escape is a latin-1 codec with some escaping knowledge. I'm not sure how to get this in line with the "any round-trip safe encoding" strategy... OTOH, if Jython users write source code which depends on the PC's locale then they are bound to write non-portable code, so fixing one encoding would certainly help here. What I don't understand is why you read the file using the PC's locale. Wouldn't it be possible to set the file encoding prior to reading from it ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

M.-A. Lemburg wrote:
After looking at several PEPs over the last couple of days, I suggest that PEP 1 be updated to require inclusion of the Last-Modified: field. At the very least, I suggest that Post-History: be checked more rigorously. (PEP 263 contains a Post-History: field, but it is blank.) I don't think it's necessary to retrofit every PEP, but I think that every PEP up for consideration should be required to comply. -- --- Aahz (@pobox.com) Hugs and backrubs -- I break Rule 6 <*> http://www.rahul.net/aahz/ Androgynous poly kinky vanilla queer het Pythonista We must not let the evil of a few trample the freedoms of the many.

Another maybe valuable thing: probably another useful heuristic to divided the open PEPs beyond proof-of-concept/no-proof-of-concept is new-syntax/new-keywords/new-"funny"-semantics/ non-backward compatible vs. infrastructure/library/etc/BDFL championed Those PEPs espacially make peope wonder: will that happen to my favorite language, oh god, when?, it seems real soon now - gulp, gasp. regards, Samuele Pedroni.

"Barry A. Warsaw" wrote:
FYI, pep2html.py now makes the date in the Last-Modified header a link to the ViewCVS page on SF. It also auto-generates the date from the mtime of the PEP file, if the header is given, but doesn't have a value. I'm sure, pep2html.py could provide more help like this in other areas too... it's a great tool ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

That looks OK to me. I the Emacs-style comment in fact compatible with Emacs? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
According to Martin, it is compatible. If it's not we'll make it so :-) Barry ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> According to Martin, it is compatible. If it's not we'll make MAL> it so :-) MAL> Barry ? I believe so, although I haven't ever used this trick to specify the file's encoding. In some quick tests, at least XEmacs doesn't bomb out on it (if I stick a real encoding in for <encoding name>). -Barry

Guido van Rossum <guido@python.org> writes:
That looks OK to me. I the Emacs-style comment in fact compatible with Emacs?
It is. I expect many people want to put "utf-8" as the encoding name, and you need Emacs 21 for that (or Emacs with Mule-UCS, or some such). In GNU Emacs, you see the effect of the coding: directive in the Emacs status line. Just try the attached file, it will indicate "R" for KOI8-R. Not sure about XEmacs. Regards, Martin

Cool! It worked for me in Emacs, but not in XEmacs. Oh well. --Guido van Rossum (home page: http://www.python.org/~guido/)

>> That looks OK to me. I the Emacs-style comment in fact compatible >> with Emacs? Martin> It is. I expect many people want to put "utf-8" as the encoding Martin> name, and you need Emacs 21 for that (or Emacs with Mule-UCS, or Martin> some such). Martin> In GNU Emacs, you see the effect of the coding: directive in the Martin> Emacs status line. Just try the attached file, it will indicate Martin> "R" for KOI8-R. Not sure about XEmacs. I use XEmacs 21.4.5 (non-MULE). I see nothing particularly interesting when visiting that file. Apropos doesn't indicate there is a variable named "coding" either. I see ":encoding", an undocumented variable. Everything else containing "coding" is more complex and seems package-specific (tramp, vm, ediff, etc). Perhaps using MULE would make a difference. Skip

"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
MvL> Guido van Rossum <guido@python.org> writes: >> That looks OK to me. I the Emacs-style comment in fact >> compatible with Emacs? MvL> It is. I expect many people want to put "utf-8" as the MvL> encoding name, and you need Emacs 21 for that (or Emacs with MvL> Mule-UCS, or some such). MvL> In GNU Emacs, you see the effect of the coding: directive in MvL> the Emacs status line. Just try the attached file, it will MvL> indicate "R" for KOI8-R. Not sure about XEmacs. I don't think it works for XEmacs. I've got a MULE-aware XEmacs 21.4.6 and while it asks if I want to set the local variables in the -*- line, I still see "Raw" in the modeline, and I see the following letters in print string (with funny little lines above the characters): iAOOEI. See attached capture. That doesn't seem right, does it? -Barry

"Martin v. Loewis" wrote:
After reading some of the Emacs docs, I think we should allow a more flexible coding line: -*- ... coding: (\w+) ... -*- because you will sometimes want to add more variables to that Emacs init line than just the encoding declaration. Does anybody know where XEmacs is moving w/r to this ? (and for that matter what about vi, vim, etc. ?) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Tue, Feb 26, 2002 at 08:50:35PM +0100, M.-A. Lemburg wrote:
Does anybody know where XEmacs is moving w/r to this ? (and for that matter what about vi, vim, etc. ?)
I'm working with Vim 6.0, 20001 Sep 14. VIM lets you set variables with text similar to vim:KEY=VALUE:KEY=VALUE:....: Apparently you would use vim:fileencoding=sjis: to select shift-jis encoding. In the vim style, it seems most common to place this at the bottom of a file, but it can be placed at the top too. The variable "modelines" controls how many lines at each end of the file is inspected, with the default being 5. It's documented that the form vi:set KEY=VALUE: may be compatible with "some versions of Vi" but does not say which. (I can't get this to work) You can set a list of encodings to attempt when a file is loaded, which defaults to "ucs-bom,utf-8,latin1". A user who wanted to treate non-unicode files as shift-jis by default would :set fileencodings=ucs-bom,utf-8,sjis You can also load a particular file with the ++enc parameter: :edit ++enc=koi8-r russian.txt (I can get this to work, but I have to do it manually to load anything in an odd character set) The emacs line is harmless in vim, but doesn't do anything. It's possible that using :autocmd someone could make vim use the emacs line to set encoding, but I'm not sure -- setting fileencoding after a file is loaded seems to perform a translation from the old characterset to the new. Jeff

jepler@unpythonic.dhs.org wrote:
So if we use the RE "coding[=:]\s*([\w-]+)" on the first line, we should be able to reach out for the encoding, right ? This RE would then cover both vim and emacs. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Wed, Feb 27, 2002 at 10:20:57AM +0100, M.-A. Lemburg wrote:
I've been informed on a #vim irc channel that "vim:fillencoding=blah:" does not work. Unfortunate. I overlooked the part of the documentation which states To read a file in a certain encoding it won't work by setting 'fileencoding', use the |++enc| argument. However, there's a "charset plugin" for vim: http://vim.sourceforge.net/scripts/script.php?script_id=199 which could be adapted to follow whatever convention is chosen for Python. However, this plugin is not standard in any version of vim. It's not clear what license it's under, but referencing it from the PEP and documenting that something like au BufReadPost *.py ReloadWhenCharset(1, "coding[:=]\s([\w-]+)") au BufReadPost *.py ReloadWhenCharset(2, "coding[:=]\s([\w-]+)") (search the first two lines for the emacs coding special marker) would cause it to detect the charset of a Python file would certainly be possible. The plugin functions by executing a reload of the file with ++enc when ReloadWhenCharset matches its pattern. Jeff

This actually works in vim with "charset plugin": let s:pep263='coding[:=]\s*\([-A-Za-z0-9_]\+\)' au BufReadPost *.py call ReloadWhenCharsetSet(1, s:pep263) au BufReadPost *.py call ReloadWhenCharsetSet(2, s:pep263) It searches for a RE compatible with PEP263 in the first and second lines. You could change the pattern from *.py to * if you want to recognize the emacs-style coding in all files. Jeff

Jeff Epler wrote:
Great ! So we can say that the RE fits vim and emacs, right ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[MAL]
I consider the above PEP ready for review by the developers. Please comment.
The pep seems to dictate that the source by default must be read as latin-1: """ Python will default to Latin-1 as standard encoding if no other encoding hints are given. """ Jython already reads the python source with the default java encoding which usually depends on the PCs locale. If a small loophole could be added to that requirement, then the pep have my full support. regards, finn

I missed this. Why not default to ASCII like any decent programming language does in the absence of an explicit encoding? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Jack had the same question. The simple answer is: we need this in order to maintain backward compatibility when we move to phase two of the implementation. Here's the longer one: ASCII is the standard encoding for Python keywords and identifiers. There is no standard source code encoding for string literals. Unicode literals are interpreted using 'unicode-escape' which is an enhanced Latin-1 with escape semantics. This makes Latin-1 the right choice: * Unicode literals already use it today * As soon as we get to phase two of the implementation, 8-bit string literals will be have to make the round trip raw binary -> Unicode -> raw binary and this only works if you make Latin-1 the default. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

But they shouldn't, IMO. We should require an explicit encoding when more than ASCII is used, and I'd like to enforce this.
Sorry, I don't understand what you're trying to say here. Can you explain this with an example? Why can't we require any program encoded in more than pure ASCII to have an encoding magic comment? I guess I don't understand why you mean by "raw binary". Once you've explained it to me, the PEP should address this issue. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org> writes:
I agree. I recommend to deprecate this feature, and raise a DeprecationWarning if a Unicode literal contains non-ASCII characters but no encoding has been declared.
With the proposed implementation, the encoding declaration is only used for Unicode literals. In all other places where non-ASCII characters can occur (comments, string literals), those characters are treated as "bytes", i.e. it is not verified that these bytes are meaningful under the declared encoding. Marc's original proposal was to apply the declared encoding to the complete source code, but I objected claiming that it would make the tokenizer changes more complex, and the resulting tokenizer likely significantly slower (atleast if you use the codecs API to perform the decoding). In phase 2, the encoding will apply to all strings. So it will not be possible to put arbitrary byte sequences in a string literal, atleast if the encoding disallows certain byte sequences (like UTF-8, or ASCII). Since this is currently possible, we have a backwards compatibility problem. Regards, Martin

I would say that any program that currently uses non-ASCII in string literals (whether Unicode or 8-bit literals) is strictly spoken undefined. For cases where a specific encoding is used, the solution is easy: add an explicit encoding. Other cases are simply garbage and should use \xDD escapes instead. Maybe an implementation phase 1a should be introduced that warns about the occurrence of non-ASCII characters anywhere in the source code when no encoding is specified. --Guido van Rossum (home page: http://www.python.org/~guido/)

"Martin v. Loewis" wrote:
I don't think that the codecs will significantly slow down overall compilation -- the compiler is not fast to begin with. However, changing the bsae type in the tokenizer and compiler from char* to Py_UNICODE* will be a significant effort and that's why I added two phases to the implementation. The first phase will only touch Unicode literals as proposed by Martin.
Right and I believe that a lot of people in European countries write strings literals with a Latin-1 encoding in mind. We cannot simply break all that code. The other problem is with comments found in Python source code. In phase 2 these will break as well. So how about this: In phase 1, the tokenizer checks the *complete file* for non-ASCII characters and outputs single warning per file if it doesn't find a coding declaration at the top. Unicode literals continue to use [raw-]unicode-escape as codec. In phase 2, we enforce ASCII as default encoding, i.e. the warning will turn into an error. The [raw-]unicode-escape codec will be extended to also support converting Unicode to Unicode, that is, only handle escape sequences in this case. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
Do you suggest that in this phase, the declared encoding is not used for anything except to complain? -1. I think people need to gain something from declaring the encoding; what they gain is that Unicode literals work right (i.e. that they really denote the strings that people see on their screen - given the appropriate editor). Regards, Martin

"Martin v. Loewis" wrote:
No. This is just an extra step on top of what is proposed in the PEP to make people aware of the problem in phase 1. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I just got a private response about the proposal from Atsuo Ishimoto, Japan. They use two different encoding in day-to-day life (one for windows, one for unix) and have their complete tool chain setup to auto-convert all files between the two environments. Recognizing the magic comment would pose a problem for them, since their tools assume conversion to the PC's locale setting. He proposed to make the interpreters default encoding the default for source files which don't specify an encoding. That is ASCII on all standard Python installations and different encodings on tweaked installations. He also told me that they put raw Shift-JIS and EUC-JP into Python literal strings -- just like Europeans do with Latin-1. Wouldn't his suggestion be a good compromise for phase 2 ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I'm OK with a way to change the default to something locale-specific, as long as there's also a way to make the default strict ASCII (for export). Maybe python -A could force the default encoding to be ASCII even if the locale specifies something different. (I'd still *prefer* it the other way around, where you have to specify an explicit option to make the default equal to the locale rather than ASCII, but I can see the other side. Sigh.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Let's put it this way: the interpreter's default encoding has to be changed explicitly by the sys admin (in sitecustomize.py), so the decision to take e.g. a locale specific default encoding is one which the admin maintaining the installation has to make (with all the consequences that go with it). Per default, the default encoding is ASCII, so I don't think we really need an extra option. Hmm, could be that python -S already implies this, BTW... checking this reveils that even sys.setdefaultencoding() remains available if -S is used. Perhaps we should remove the API with -S too ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

OK. I missed that part -- I thought that it would look in the locale by default.
Per default, the default encoding is ASCII, so I don't think we really need an extra option.
Agreed.
Hmm, could be that python -S already implies this, BTW...
:-)
I don't think so. It should be left in, caveat emptor. --Guido van Rossum (home page: http://www.python.org/~guido/)

The problem I have with PEP 263 right now is that the "-*- coding: -*-" magic is really sort of being abused. I gather that "coding:" is supposed to specify the encoding (what MIME calls "charset") of the file. But under PEP 263, it only refers to the Unicode string literals within the program. Everything else must still be treated as 8-bit text. For example, I'm not sure what effect "coding: utf-16" would have. (?) For another example, if you have UTF-8 Unicode string literals in your program but you also have 8-bit Latin-1 plain str string literals in the same program, how should you mark it? How will Emacs then treat it? Is a Python program an 8-bit string or a Unicode string? Right now, although perhaps someone who knows more about the parser than I can expand on this, it seems that Python programs are 8-bit strings. Therefore I argue that it makes no sense to use "coding:" to label a Python file, because the file doesn't consist of Unicode text. ## Jason Orendorff http://www.jorendorff.com/

Jason wrote:
The problem I have with PEP 263 right now is that the "-*- coding: -*-" magic is really sort of being abused.
really?
from the current version (revision 1.9) of the PEP: "The complete Python source file should use a single encoding."
For example, I'm not sure what effect "coding: utf-16" would have. (?)
"Only ASCII compatible encodings are allowed."
"Embedding of differently encoded data is not allowed"
"the proposed solution should be implemented in two phases: 1. Implement the magic comment detection and default encoding handling, but only apply the detected encoding to Unicode literals in the source file. 2. Change the tokenizer/compiler base string type from char* to Py_UNICODE* and apply the encoding to the complete file." </F>

"Jason Orendorff" <jason@jorendorff.com> writes:
Not really. If you are willing to separate the language and its implementation, then I'd phrase the intent that way: - if an encoding is declared, all of the file must follow that encoding (all of them, always (*)) - in phase 1, the implementation will not verify that property, except for Unicode literals - in phase 2, Python will implement Python completely in this respect.
For example, I'm not sure what effect "coding: utf-16" would have. (?)
Invalid; source encodings must be an ASCII superset (not sure how the implementation will react to that; if the file really is UTF-16, you'll get a syntax error, if you say it is UTF-16 but it isn't, Python will reject it in phase 2).
You should mark the file as UTF-8. In phase 2, Python will reject it. At that point, you should convert your latin-1 string literal into hex escapes - it is binary data then, not Latin-1.
How will Emacs then treat it?
Don't know - just try. You cannot create such a file with Emacs.
Is a Python program an 8-bit string or a Unicode string?

"M.-A. Lemburg" <mal@lemburg.com> writes:
I expected that much; chosing Latin-1 as the default encoding is certainly Euro-centric. At the moment, declaring either eucJP or or Shift-JIS wouldn't work with the proposed implementation, anyway, since those encodings are not supported in the standard Python installation.
Wouldn't his suggestion be a good compromise for phase 2 ?
This raises the question what exactly should be deprecated. AFAIK, both eucJP and Shift-JIS use non-ASCII bytes to denote Japanese characters, so they'd get a DeprecationWarning on every file. However, they could not put an encoding declaration into the file, as Python would not recognize the encoding. I don't see the convention to convert as too much of a stumbling block; to my knowledge, many editors can display text in both encodings correctly these days (but I may be wrong with that assumption). Regards, Martin

"Martin v. Loewis" wrote:
But they will be using Tamito's Japanese codecs... and, of course, they do work now in string literals, since there is no enforcement of any encoding in the compiler.
With Tamito's codecs installed, this wouldn't be a problem. Putting the encoding comment in the files will turn the compiler quiet in phase 1 and in phase 2 assure that their editors do in fact use the defined encoding. FYI, I've updated the PEP to use the interpreter's default encoding as basis for the source file encoding too. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
MvL> At the moment, declaring either eucJP or or Shift-JIS MvL> wouldn't work with the proposed implementation, anyway, since MvL> those encodings are not supported in the standard Python MvL> installation. Which actually touches on something I wanted to bring up. Why don't we include the Japanese codecs with Python? Is it just a size issue? The gzip'd tarball of the JapaneseCodecs-1.4.3 is 258k, unpacked it's 3.2M. Okay, so that's nontrivial, but I can think of 2 approaches: - Have a second, sumo (no pun intended) release that inclues the codecs - Include the gzip'd tarball and do a distutils install at Python install time I bet we'd win some Ruby converts if we did this <wink>. For reference, I'm thinking about including the Japanese and Chinese codecs with MM2.1 because it makes little sense to claim support for those languages without them. -Barry

"Barry A. Warsaw" wrote:
Why not simply make the installation a configure option ? We could easily extend setup.py to grab the tarball from the web in case it is needed.
Agreed. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> Why not simply make the installation a configure option ? MAL> We could easily extend setup.py to grab the tarball from MAL> the web in case it is needed. That's another option. Certainly stuff like that is becoming fairly common for installers these days. >> I bet we'd win some Ruby converts if we did this <wink>. For >> reference, I'm thinking about including the Japanese and >> Chinese codecs with MM2.1 because it makes little sense to >> claim support for those languages without them. MAL> Agreed. -Barry

"Barry A. Warsaw" wrote:
Hmm, make that ZIP-ball (we have no .tar support in the standard lib, only ZIP-file support). Also, the setup.py will have to check whether it has to grab a level 0 compression ZIP file or a level 9 one. Nothing which cannot be done, of course... net installers are quite common these days (see e.g. Mozilla, IE and others), so people are probably quite used to them already. And we can always provide a full install download as well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> Hmm, make that ZIP-ball (we have no .tar support in the MAL> standard lib, only ZIP-file support). Also, the setup.py will MAL> have to check whether it has to grab a level 0 compression MAL> ZIP file or a level 9 one. MAL> Nothing which cannot be done, of course... net installers are MAL> quite common these days (see e.g. Mozilla, IE and others), so MAL> people are probably quite used to them already. And we can MAL> always provide a full install download as well. Isn't there some PEP about all this? <wink> -Barry

barry@zope.com (Barry A. Warsaw) writes:
Which actually touches on something I wanted to bring up. Why don't we include the Japanese codecs with Python? Is it just a size issue?
I think Guido's original concern was about the size (apart from the fact that they were not available before). My concern is also correctness and efficiency. Most current systems provide high-performance well-tested codecs, since they need those frequently. It is a waste of resources not to make use of these codecs. The counter-argument, of course, is that you cannot always rely on these codecs being available (apart from the fact that you need wrappers around the platform API).
That is certainly the right thing to do. If correctness could be verified independently, I'd be in favour of including them with Python - even though they will likely never get the efficiency that wrappers around the platform's codecs would have. Regards, Martin

"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
>> I bet we'd win some Ruby converts if we did this <wink>. For >> reference, I'm thinking about including the Japanese and >> Chinese codecs with MM2.1 because it makes little sense to >> claim support for those languages without them. MvL> That is certainly the right thing to do. If correctness could MvL> be verified independently, I'd be in favour of including them MvL> with Python - even though they will likely never get the MvL> efficiency that wrappers around the platform's codecs would MvL> have. I'm obviously not qualified to verify them independently, but I have had some initial positive feedback from a few Japanese users of the MM2.1 alphas. My second hand information indicates that he Japanese codecs are pretty good, the Chinese are okay, and the Korean ones need a lot of work. Also, it's a bit of a catch 22, in that the more official exposure these codecs get, the better they will eventually become, hopefully. I'd be +1 on including them in Python 2.3. -Barry

"Barry A. Warsaw" wrote:
You could (and probably should) add Tamito's codecs in Python, but the others have licensing problems :-/ It shouldn't be hard though for native speakers and programmers to build upon the work of Tamito and get those codecs done as well. Alternatively, the PSF or some company interested in having these codecs available could fund the development. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
You could (and probably should) add Tamito's codecs in Python, but the others have licensing problems :-/
I would not recommend to incorporate any of this into Python without asking the author(s). When doing so, it would be appropriate, IMO, to ask them whether they would fill out the contributor agreement. Then, the presumed licensing problems would be gone. Regards, Martin

[This thread probably ought to be moved to i18n-sig, so I'm CC'ing them and will remove all future cc's to python-dev. -BAW]
"MAL" == M <mal@lemburg.com> writes:
MAL> You could (and probably should) add Tamito's codecs in MAL> Python, but the others have licensing problems :-/ I believe I am using Tamito KAJIYAMA's codecs, from: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ Or were you thinking about some different Japanese codecs? The ones at this url are BSD-ish and so should be compatible with the PSF license, GPL, etc. MAL> It shouldn't be hard though for native speakers and MAL> programmers to build upon the work of Tamito and get those MAL> codecs done as well. Alternatively, the PSF or some company MAL> interested in having these codecs available could fund the MAL> development. All good points. I still think that by giving more visibility to the codecs (i.e. adding them to the Python distro) would help bring muscle to the effort.
"MvL" == Martin v Loewis <martin@v.loewis.de> writes:
MvL> I would not recommend to incorporate any of this into Python MvL> without asking the author(s). When doing so, it would be MvL> appropriate, IMO, to ask them whether they would fill out the MvL> contributor agreement. Then, the presumed licensing problems MvL> would be gone. Agreed on both points! -Barry

I've been working on a unified architecture for the Asian codecs. I presented a paper about it at the last Unicode Conference in Washington D.C. You can find it at http://www.basistech.com/articles/python-zh-transcoding_iuc20_TE2.pdf The presentation concentrates on Chinese, but the architecture will work for JK as well. -tree -- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"

"Barry A. Warsaw" wrote:
+1. The PSF will have to agree on the contribution docs first, though. Since there's no discussion on the PSF docs discussion list, I suppose everybody is happy with them :-) BTW, I was referring to the other codecs in the python-codecs project on SF. Most of those are encumbered by the GPL and thus unusable in non-GPL projects. Tamito has switched to a BSD-license after some private discussions about this, which is goodness :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"MAL" == M <mal@lemburg.com> writes:
MAL> The PSF will have to agree on the contribution docs first, MAL> though. Since there's no discussion on the PSF docs MAL> discussion list, I suppose everybody is happy with them :-) I am. What do we need to do next? MAL> BTW, I was referring to the other codecs in the python-codecs MAL> project on SF. Most of those are encumbered by the GPL and MAL> thus unusable in non-GPL projects. MAL> Tamito has switched to a BSD-license after some private MAL> discussions about this, which is goodness :-)

"Barry A. Warsaw" wrote:
Wait. The deadline is mid-March. After that the docs will have to go to the lawyer and only then we can use them...
Me neither, but Tamito has put a lot of work into them and with his move to C for the codec engine, speed is not an issue anymore either. Also, I've asked him about his thoughts about having them included in the core before. He would be happy with that move. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
participants (16)
-
aahz@rahul.net
-
barry@zope.com
-
bckfnn@worldonline.dk
-
Fredrik Lundh
-
Guido van Rossum
-
Jack Jansen
-
Jason Orendorff
-
Jeff Epler
-
jepler@unpythonic.dhs.org
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Neal Norwitz
-
Samuele Pedroni
-
Skip Montanaro
-
Tim Peters
-
Tom Emerson