After the discussion about #pragmas two weeks ago and some interesting ideas in the direction of source code encodings and ways to implement them, I would like to restart the talk about encodings in source code and runtime auto-conversions. Fredrik recently posted patches to the patches list which loosen the currently hard-coded default encoding used throughout the Unicode design and add a layer of abstraction which would make it easily possible to change the default encoding at some later point. While making things more abstract is certainly a wise thing to do, I am not sure whether this particular case fits into the design decisions made a few months ago. Here's a short summary of what was discussed recently: 1. Fredrik posted the idea of changing the default encoding from UTF-8 to Latin-1 (he calls this 8-bit Unicode which points to the motivation behind this: 8-bit strings should behave like 8-bit Unicode). His recent patches work into this direction. 2. Fredrik also posted an interesting idea which enables writing Python source code in any supported encoding by having the Python tokenizer read Py_UNICODE data instead of char data. A preprocessor would take care of converting the input to Py_UNICODE; the parser would assure that 8-bit string data gets converted back to char data (using e.g. UTF-8 or Latin-1 for the encoding) 3. Regarding the addition of pragmas to allow specifying the used source code encoding several possibilities were mentioned: - addition of a keyword "pragma" to define pragma dictionaries - usage of a "global" as basis for this - adding a new keyword "decl" which also allows defining other things such as type information - XML like syntax embedded into Python comments Some comments: Ad 1. UTF-8 is used as basis in many other languages such as TCL or Perl. It is not an intuitive way of writing strings and causes problems due to one character spanning 1-6 bytes. Still, the world seems to be moving into this direction, so going the same way can't be all wrong... Note that stream IO can be recoded in a way which allows Python to print and read e.g. Latin-1 (see below). The general idea behind the fixed default encoding design was to give all the power to the user, since she eventually knows best which encoding to use or expect. Ad 2. I like this idea because it enables writing Unicode- aware programs *in* Unicode... the only problem which remains is again the encoding to use for the classic 8-bit strings. Ad 3. For 2. to work, the encoding would have to appear close to the top of the file. The preprocessor would have to be BOM-mark aware to tell whether UTF-16 or some ASCII extension is used by the file. Guido asked me for some code which demonstrates Latin-1 recoding using the existing mechanisms. I've attached a simple script to this mail. It is not much tested yet, so please give it a try. You can also change it to use any other encoding you like. Together with the Japanese codecs provided by Tamito Kajiyama (http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/tmp/japanese-codecs.tar.gz) you should be able to type Shift-JIS at the raw_input() or interactive prompt, have it stored as UTF-8 and then printed back as Shift-JIS, provided you put add a recoder similar to the attached one for Latin-1 to your PYTHONSTARTUP or site.py script. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/