RE: [Python-Dev] #pragmas in Python source code

Java uses ResourceBundles, which are identified by basename + 2 character locale id (eg "en", "fr" etc). The content of the resource bundle is essentially a dictionary of name value pairs. MS Visual C++ uses pragma code_page(windows_code_page_id) in resource files to indicate what code page was used to generate the subsequent text. In both cases, an application would rely on a fixed (7 bit ASCII) subset to give the well-known key to find the localized text for the current locale. Any "hardcoded" string literals would be mangled when attempting to display them using an alternate locale. So essentially, one could take the view that correct support for localization is a runtime issue affecting the user of an application, not the developer. Hence, myfile.py may contain 8 bit string literals encoded in my current windows encoding (1252) but my user may be using Japanese Windows in code page 932. All I can guarantee is that the first 128 characters (notwithstanding BACKSLASH) will be rendered correctly - other characters will be interpreted as half width Katakana or worse. Any literal strings one embeds in code should be purely for the benefit of the code, not for the end user, who should be seeing properly localized text, pulled back from a localized text resource file _NOT_ python code, and automatically pumped through the appropriate native <--> unicode translations as required by the code. So to sum up, 1 Hardcoded strings are evil in source code unless they use the invariant ASCII (and by extension UTF8) character set. 2 A proper localized resource loading mechanism is required to fetch genuine localized text from a static resource file (ie not myfile.py). 3 All transformations of 8 bit strings to and from unicode should explicitly specify the 8 bit encoding for the source/target of the conversion, as appropriate. 4 Assume that a Japanese / Chinese programmer will find it easier to code using the invariant ASCII subset than a Western European / American will be able to read hanzi in source code. Regards, Mike da Silva -----Original Message----- From: Ka-Ping Yee [mailto:ping@lfw.org] Sent: Wednesday, April 12, 2000 6:45 PM To: Fred L. Drake, Jr. Cc: Python Developers @ python.org Subject: Re: [Python-Dev] #pragmas in Python source code On Wed, 12 Apr 2000, Fred L. Drake, Jr. wrote:
Or do we need to separate out two categories of pragmas -- pre-parse and post-parse pragmas?
Eeeks! We don't need too many special forms! That's ugly!
Eek indeed. I'm tempted to suggest we drop the multiple-encoding issue (i can hear the screams now). But you're right, i've never heard of another language that can handle configurable encodings right in the source code. Is it really necessary to tackle that here? Gak, what do Japanese programmers do? Has anyone seen any of that kind of source code? -- ?!ng _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev

Mike wrote:
Any literal strings one embeds in code should be purely for the benefit of the code, not for the end user, who should be seeing properly localized text, pulled back from a localized text resource file _NOT_ python code, and automatically pumped through the appropriate native <--> unicode translations as required by the code.
that's hardly a CP4E compatible solution, is it? Ping wrote:
But you're right, i've never heard of another language that can handle configurable encodings right in the source code.
XML? </F>

On Wed, 12 Apr 2000, Fredrik Lundh wrote:
Ping wrote:
But you're right, i've never heard of another language that can handle configurable encodings right in the source code.
XML?
Don't get me started. XML is not a language. It's a serialization format for trees (isomorphic to s-expressions, but five times more verbose). It has no semantics. Anyone who tries to tell you otherwise is probably a marketing drone or has been brainwashed by the buzzword brigade. -- ?!ng

Ka-Ping Yee writes:
Don't get me started. XML is not a language. It's a serialization
And XML was exactly why I asked about *programming* languages. XML just doesn't qualify in any way I can think of as a language. Unless it's also called "Marketing-speak." ;) XML, as you point out, is a syntactic aspect of tree encoding. Harrumph. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

Fred L. Drake, Jr. wrote:
Don't get me started. XML is not a language. It's a serialization
And XML was exactly why I asked about *programming* languages. XML just doesn't qualify in any way I can think of as a language.
oh, come on. in what way is "Python source code" more expressive than XML, if you don't have anything that inter- prets it? does the Python parser create "better" trees than an XML parser?
XML, as you point out, is a syntactic aspect of tree encoding.
just like a Python source file is a syntactic aspect of a Python (parse) tree encoding, right? ;-) ... but back to the real issue -- the point is that XML provides a mechanism for going from an external representation to an in- ternal (unicode) token stream, and that mechanism is good enough for python source code. why invent yet another python-specific wheel? </F>

Fred L. Drake, Jr. wrote:
And XML was exactly why I asked about *programming* languages. XML just doesn't qualify in any way I can think of as a language.
I'm harumphing right along with you, Fred. :) On Wed, 12 Apr 2000, Fredrik Lundh wrote:
oh, come on. in what way is "Python source code" more expressive than XML, if you don't have anything that inter- prets it? does the Python parser create "better" trees than an XML parser?
Python isn't just a parse tree. It has semantics. XML has no semantics. It's content-free content. :)
but back to the real issue -- the point is that XML provides a mechanism for going from an external representation to an in- ternal (unicode) token stream, and that mechanism is good enough for python source code.
You have a point. I'll go look at what they do. -- ?!ng

Ka-Ping Yee wrote: Python isn't just a parse tree. It has semantics. XML has no semantics. It's content-free content. :)
Python doesn't even have a parse tree (never mind semantics) unless you have a Python parser handy. XML gives my application a way to parse your information, even if I can't understand it, which is a big step over (for example) comments or strings embedded in Python/Perl/Java source files, colon (or is semi-colon?) separated lists in .ini and .rc files, etc. (I say this having wrestled with large Fortran programs in which a sizeable fraction of the semantics was hidden in comment-style pragmas. Having seen the demands this style of coding places on compilers, and compiler writers, I'm willing to walk barefoot through the tundra to get something more structured. Hanging one of Barry's doc dict's off a module ensures that key information is part of the parse tree, and that anyone who wants to extend the mechanism can do so in a structured way. I'd still rather have direct embedding of XML, but I think doc dicts are still a big step forward.) Greg p.s. This has come up as a major issue in the Software Carpentry competition. On the one hand, you want (the equivalent of) makefiles to be language neutral, so that (for example) you can write processors in Perl and Java as well as Python. On the other hand, you want to have functions, lists, and all the other goodies associated with a language.

Ka-Ping Yee wrote:
XML?
Don't get me started. XML is not a language. It's a serialization format for trees (isomorphic to s-expressions, but five times more verbose).
call it whatever you want -- my point was that their way of handling configurable encodings in the source code is good enough for python. (briefly, it's all unicode on the inside, and either ASCII/UTF-8 or something compatible enough to allow the parser to find the "en- coding" attribute without too much trouble... except for the de- fault encoding, the same approach should work for python) </F>

Ka-Ping Yee <ping@lfw.org>:
XML?
Don't get me started. XML is not a language. It's a serialization format for trees (isomorphic to s-expressions, but five times more verbose). It has no semantics. Anyone who tries to tell you otherwise is probably a marketing drone or has been brainwashed by the buzzword brigade.
Heh. What he said. Squared. Describing XML as a "language" around an old-time LISPer like me (or a new-time one like Ping) is a damn good way to get your eyebrows singed. -- <a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a> "...quemadmodum gladius neminem occidit, occidentis telum est." [...a sword never kills anybody; it's a tool in the killer's hand.] -- (Lucius Annaeus) Seneca "the Younger" (ca. 4 BC-65 AD),

Well, as long as everyone else is going to be off-topic: What definition of "language" are you using? And while you're at it, what definition of "semantics" are you using? As I recall, a string is an ordered list of symbols and a language is an unordered set of strings. I know that Ka-Ping, despite going to a great university was in Engineering, not computer science, so I'll excuse him for not knowing the Chomskian definition of language, :), but what's your excuse Eric? Most XML people will happily admit that XML has no "semantics" but I think that's bullshit too. The mapping from the string to the abstract tree data model *is the semantic content* of the XML specification. Yes, it is a brain-dead simple mapping and so the semantic structure provided by the XML specification is minimal...but that's the whole point. It's supposed to be simple. It's supposed to not get in the way of higher level semantics. It makes as little sense to reject XML out of hand because it is a buzzword but is not innovative as it does for people to embrace it mystically because it is Microsoft's flavor of the week. XML takes simple ideas from the Lisp and document processing communities and popularize them so that they can achieve economies of scale. It sounds exactly like the relationship between Lisp and Python to me... By the way, what data model or text encoding is NOT isomorphic to Lisp S-expressions? Isn't Python code isomorphic to Lisp s-expessions? Paul Prescod

I'll begin with my conclusion, so you can get the high-order bit and skip the rest of the message if you like: XML is useful, but it's not a language. On Thu, 13 Apr 2000, Paul Prescod wrote:
What definition of "language" are you using? And while you're at it, what definition of "semantics" are you using?
As I recall, a string is an ordered list of symbols and a language is an unordered set of strings.
I use the word "language" to mean an expression medium that carries semantics at a usefully high level. The computer-science definition you gave would include, say, the set of all pairs of integers (not a language to most people), but not include classical music [1] (indeed a language to many people). I admit that the boundary of "enough" semantics to qualify as a language is fuzzy, but some things do seem quite clearly to fall on either side of the line for me. For example, saying that XML has semantics is roughly equivalent to saying that ASCII has semantics. Well, sure, i suppose 65 has the "semantics" of the uppercase letter A, but that's not semantics at any level high enough to be useful. That is why most people would probably not normally call ASCII a "language". It has to express something to be a language. Granted, you can pick nits and say that XML has semantics as you did, but to me that essentially amounts to calling the syntax the semantics.
I know that Ka-Ping, despite going to a great university was in Engineering, not computer science
Cute. (I'm glad i was in engineering; at least we got a little design and software engineering background there, and i didn't see much of that in CS, unfortunately.)
Most XML people will happily admit that XML has no "semantics" but I think that's bullshit too. The mapping from the string to the abstract tree data model *is the semantic content* of the XML specification.
Okay, fine. Technically, it has semantics; they're just very minimal semantics (so minimal that i felt quite comfortable in saying that it has none). But that doesn't change my point -- for "it has no semantics and therefore doesn't qualify as a language" just read "it has far too minimal semantics to qualify as a language".
It makes as little sense to reject XML out of hand because it is a buzzword but is not innovative as it does for people to embrace it mystically because it is Microsoft's flavor of the week.
Before you get the wrong impression, i don't intend to reject XML out of hand, or to suggest that people do. It has its uses, just as ASCII has its uses. As a way of serializing trees, it's quite acceptable. I am, however, reacting to the sudden onslaught of hype that gives people the impression that XML can do anything. It's this sort of attitude that "oh, all of our representation problems will go away if we throw XML at it" that makes me cringe; that's only avoiding the issue. (I'm not saying that you are this clueless, Paul! -- just that some people seem to be.) As long as we recognize XML as exactly what it is, no more and no less -- a generic mechanism for serializing trees, with associated technologies for manipulating those trees -- there's no problem.
By the way, what data model or text encoding is NOT isomorphic to Lisp S-expressions? Isn't Python code isomorphic to Lisp s-expessions?
No! You can run Python code. The code itself, of course, can be interpreted as a stream of bytes, or arranged into a tree of LISP s-expressions. But if s-expressions that were *all* that constituted Python, Python would be pretty useless indeed! The entity we call Python includes real content: the rules for deriving the expected behaviour of a Python program from its parse tree, as variously specified in the reference manual, the library manual, and in our heads. LISP itself is a great deal more than just s-expressions. The language system specifies the behaviour you expect from a given piece of LISP code, and *that* is the part i call semantics. "real" semantics: Python LISP English MIDI minimal or no semantics: ASCII lists alphabet bytes The things in the top row are generally referred to as "languages"; the things in the bottom row are not. Although each thing in the top row is constructed from its corresponding thing in the bottom row, the difference between the two is what i am calling "semantics". If the top row says A and the bottom row says B, you can look at the B-type things that constitute the A and say, "if you see this particular B, it means foo". XML belongs in the bottom row, not the top row. Python: "If you see 'a = 3' in a function, it means you take the integer object 3 and bind it to the name 'a' in the local namespace." XML: "If you see the tag <spam eggs=boiled>, it means... well... uh, nothing. Sorry. But you do get to decide that 'spam' and 'eggs' and 'boiled' mean whatever you want." That is why i am unhappy with XML being referred to as a "language": it is a misleading label that encourages people to make the mistake of imagining that XML has more semantic power than it really does. Why is this a fatal mistake? Because using XML will no more solve your information interchange problems than writing Japanese using the Roman alphabet will suddenly cause English speakers to be able to read Japanese novels. It may *help*, but there's a lot more to it than serialization. Thus: XML is useful, but it's not a language. And, since that reasonably summarizes my views on the issue, i'll say no more on this topic on the python-dev list -- any further blabbing i'll do in private e-mail. -- ?!ng "In the sciences, we are now uniquely privileged to sit side by side with the giants on whose shoulders we stand." -- Gerald Holton [1] I anticipate an objection such as "but you can encode a piece of classical music as accurately as you like as a sequence of symbols." But the music itself doesn't fit the Chomskian definition of "language" until you add that symbolic mapping and the rules to arrange those symbols in sequence. At that point the thing you've just added *is* the language: it's the mapping from symbols to the semantics of e.g. "and at time 5.36 seconds the first violinist will play an A-flat at medium volume".

Let's presume that we agreed that XML is not a language because it doesn't have semantics. What does that have to do with the applicability of its Unicode-handling model? Here is a list of a hundred specifications which we can probably agree have "useful semantics" that are all based on XML and thus have the same Unicode model: http://www.xml.org/xmlorg_registry/index.shtml XML's unicode model seems mostly appropriate to me. I can only see one reason it might not apply: which comes first the #! line or the #encoding line? We could say that the #! line can only be used in encodings that are direct supersets of ASCII (e.g. UTF-8 but not UTF-16). That shouldnt' cause any problems with Unix because as far as I know, Unix can only read the first line if it is in an ASCII superset anyhow! Then the second line could describe the precise ASCII superset in use (8859-1, 8859-2, UTF-8, raw ASCII, etc.). -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself When George Bush entered office, a Washington Post-ABC News poll found that 62 percent of Americans "would be willing to give up a few of the freedoms we have" for the war effort. They have gotten their wish. - "This is your bill of rights...on drugs", Harpers, Dec. 1999

Paul Prescod wrote:
Let's presume that we agreed that XML is not a language because it doesn't have semantics. What does that have to do with the applicability of its Unicode-handling model?
Here is a list of a hundred specifications which we can probably agree have "useful semantics" that are all based on XML and thus have the same Unicode model:
http://www.xml.org/xmlorg_registry/index.shtml
XML's unicode model seems mostly appropriate to me. I can only see one reason it might not apply: which comes first the #! line or the #encoding line? We could say that the #! line can only be used in encodings that are direct supersets of ASCII (e.g. UTF-8 but not UTF-16). That shouldnt' cause any problems with Unix because as far as I know, Unix can only read the first line if it is in an ASCII superset anyhow!
Then the second line could describe the precise ASCII superset in use (8859-1, 8859-2, UTF-8, raw ASCII, etc.).
Sounds like a good idea... how would such a line look like ? #!/usr/bin/env python # version: 1.6, encoding: iso-8859-1 ... Meaning: the module script needs Python version >=1.6 and uses iso-8859-1 as source file encoding. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Ping]
But you're right, i've never heard of another language that can handle configurable encodings right in the source code.
[The eff-bot]
XML?
[Ping]
Don't get me started. XML is not a language. It's a serialization format for trees (isomorphic to s-expressions, but five times more verbose). It has no semantics. Anyone who tries to tell you otherwise is probably a marketing drone or has been brainwashed by the buzzword brigade.
Of coursem but "everything is a tree". If you put Python in XML by having the parse-tree serialized, then you can handle any encoding in the source file, by snarfing it from XML. not-in-favour-of-Python-in-XML-but-this-is-sure-to-encourage-Greg-Wilson-ly y'rs, Z. -- Moshe Zadka <mzadka@geocities.com>. http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com

I think we should put the discussion back on track again... We were originally talking about proposals to integrate #pragmas into Python source. These pragmas are (for now) intended to provide information to the Python byte code compiler, so that it can make certain assumptions on a per file basis. So far, there have been numerous proposals for all kinds of declarations and decorations of files, functions, methods, etc. As usual in Python Space, things got generalized to a point where people forgot about the original intent ;-) The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings..."). The idea behind this is that programmers should be able to use other encodings here than the default "unicode-escape" one. Perhaps someone has a better idea on how to signify this to the compiler ? Could be that we don't need this pragma discussion at all if there is a different, more elegant solution to this... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings...").
why not? why keep on pretending that strings and strings are two different things? it's an artificial distinction, and it only causes problems all over the place.
Could be that we don't need this pragma discussion at all if there is a different, more elegant solution to this...
here's one way: 1. standardize on *unicode* as the internal character set. use an encoding marker to specify what *external* encoding you're using for the *entire* source file. output from the tokenizer is a stream of *unicode* strings. 2. if the user tries to store a unicode character larger than 255 in an 8-bit string, raise an OverflowError. 3. the default encoding is "none" (instead of XML's "utf-8"). in this case, treat the script as an ascii superset, and store each string literal as is (character-wise, not byte-wise). additional notes: -- item (3) is for backwards compatibility only. might be okay to change this in Py3K, but not before that. -- leave the implementation of (1) to 1.7. for now, assume that scripts have the default encoding, which means that (2) cannot happen. -- we still need an encoding marker for ascii supersets (how about <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to the tokenizer to detect that one, not the parser. the parser only sees unicode strings. </F>

Fredrik Lundh writes:
-- item (3) is for backwards compatibility only. might be okay to change this in Py3K, but not before that.
-- leave the implementation of (1) to 1.7. for now, assume that scripts have the default encoding, which means that (2) cannot happen.
We shouldn't need to change it then; Unicode editing capabilities will be pervasive by then, right? Oh, heck, it might even be legacy support by then! ;) Seriously, I'd hesitate to change any interpretation of default encoding until Unicode support is pervasive and fully automatic in tools like Notepad, vi/vim, XEmacs, and BBedit/Alpha (or whatever people use on MacOS these days). If I can't use teco on it, we're being too pro-active! ;)
-- we still need an encoding marker for ascii supersets (how about <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to the tokenizer to detect that one, not the parser. the parser only sees unicode strings.
Agreed here. But shouldn't that be: <?python version="1.6" encoding="utf-8"?> This is war, I tell you, war! ;) Now, just need to hack the exec(2) call on all the Unices so that <?python version="..." ...?> is properly recognized and used to run the scripts properly, obviating the need for those nasty shbang lines! ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings...").
why not?
Because plain old 8-bit strings should work just as before, that is, existing scripts only using 8-bit strings should not break.
why keep on pretending that strings and strings are two different things? it's an artificial distinction, and it only causes problems all over the place.
Sure. The point is that we can't just drop the old 8-bit strings... not until Py3K at least (and as Fred already said, all standard editors will have native Unicode support by then). So for now we're stuck with Unicode *and* 8-bit strings and have to make the two meet somehow -- which isn't all that easy, since 8-bit strings carry no encoding information.
Could be that we don't need this pragma discussion at all if there is a different, more elegant solution to this...
here's one way:
1. standardize on *unicode* as the internal character set. use an encoding marker to specify what *external* encoding you're using for the *entire* source file. output from the tokenizer is a stream of *unicode* strings.
Yep, that would work in Py3K...
2. if the user tries to store a unicode character larger than 255 in an 8-bit string, raise an OverflowError.
There are no 8-bit strings in Py3K -- only 8-bit data buffers which don't have string methods ;-)
3. the default encoding is "none" (instead of XML's "utf-8"). in this case, treat the script as an ascii superset, and store each string literal as is (character-wise, not byte-wise).
Uhm. I think UTF-8 will be the standard for text file formats by then... so why not make it UTF-8 ?
additional notes:
-- item (3) is for backwards compatibility only. might be okay to change this in Py3K, but not before that.
-- leave the implementation of (1) to 1.7. for now, assume that scripts have the default encoding, which means that (2) cannot happen.
I'd say, leave all this to Py3K.
-- we still need an encoding marker for ascii supersets (how about <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to the tokenizer to detect that one, not the parser. the parser only sees unicode strings.
Hmm, the tokenizer doesn't do any string -> object conversion. That's a task done by the parser. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings...").
why not?
Because plain old 8-bit strings should work just as before, that is, existing scripts only using 8-bit strings should not break.
but they won't -- if you don't use an encoding directive, and don't use 8-bit characters in your string literals, everything works as before. (that's why the default is "none" and not "utf-8") if you use 8-bit characters in your source code and wish to add an encoding directive, you need to add the right encoding directive...
why keep on pretending that strings and strings are two different things? it's an artificial distinction, and it only causes problems all over the place.
Sure. The point is that we can't just drop the old 8-bit strings... not until Py3K at least (and as Fred already said, all standard editors will have native Unicode support by then).
I discussed that in my original "all characters are unicode characters" proposal. in my proposal, the standard string type will have to roles: a string either contains unicode characters, or binary bytes. -- if it contains unicode characters, python guarantees that methods like strip, lower (etc), and regular expressions work as expected. -- if it contains binary data, you can still use indexing, slicing, find, split, etc. but they then work on bytes, not on chars. it's still up to the programmer to keep track of what a certain string object is (a real string, a chunk of binary data, an en- coded string, a jpeg image, etc). if the programmer wants to convert between a unicode string and an external encoding to use a certain unicode encoding, she needs to spell it out. the codecs are never called "under the hood". (note that if you encode a unicode string into some other encoding, the result is binary buffer. operations like strip, lower et al does *not* work on encoded strings).
So for now we're stuck with Unicode *and* 8-bit strings and have to make the two meet somehow -- which isn't all that easy, since 8-bit strings carry no encoding information.
in my proposal, both string types hold unicode strings. they don't need to carry any encoding information, because they're not encoded.
Could be that we don't need this pragma discussion at all if there is a different, more elegant solution to this...
here's one way:
1. standardize on *unicode* as the internal character set. use an encoding marker to specify what *external* encoding you're using for the *entire* source file. output from the tokenizer is a stream of *unicode* strings.
Yep, that would work in Py3K...
or 1.7 -- see below.
2. if the user tries to store a unicode character larger than 255 in an 8-bit string, raise an OverflowError.
There are no 8-bit strings in Py3K -- only 8-bit data buffers which don't have string methods ;-)
oh, you've seen the Py3K specification?
3. the default encoding is "none" (instead of XML's "utf-8"). in this case, treat the script as an ascii superset, and store each string literal as is (character-wise, not byte-wise).
Uhm. I think UTF-8 will be the standard for text file formats by then... so why not make it UTF-8 ?
in time for 1.6? or you mean Py3K? sure! I said that in my first "additional note", didn't I:
additional notes:
-- item (3) is for backwards compatibility only. might be okay to change this in Py3K, but not before that.
-- leave the implementation of (1) to 1.7. for now, assume that scripts have the default encoding, which means that (2) cannot happen.
I'd say, leave all this to Py3K.
do you mean it's okay to settle for a broken design in 1.6, since we can fix it in Py3K? that's scary. fixing the design is not that hard, and can be done now. implementing all parts of it is harder, and require extensive changes to the compiler/interpreter architecture. but iirc, such changes are already planned for 1.7...
-- we still need an encoding marker for ascii supersets (how about <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to the tokenizer to detect that one, not the parser. the parser only sees unicode strings.
Hmm, the tokenizer doesn't do any string -> object conversion. That's a task done by the parser.
"unicode string" meant Py_UNICODE*, not PyUnicodeObject. if the tokenizer does the actual conversion doesn't really matter; the point is that once the code has passed through the tokenizer, it's unicode. </F>

Fredrik Lundh wrote:
M.-A. Lemburg wrote:
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings...").
why not?
Because plain old 8-bit strings should work just as before, that is, existing scripts only using 8-bit strings should not break.
but they won't -- if you don't use an encoding directive, and don't use 8-bit characters in your string literals, everything works as before.
(that's why the default is "none" and not "utf-8")
if you use 8-bit characters in your source code and wish to add an encoding directive, you need to add the right encoding directive...
Fair enough, but this would render all the auto-coercion code currently in 1.6 useless -- all string to Unicode conversions would have to raise an exception.
why keep on pretending that strings and strings are two different things? it's an artificial distinction, and it only causes problems all over the place.
Sure. The point is that we can't just drop the old 8-bit strings... not until Py3K at least (and as Fred already said, all standard editors will have native Unicode support by then).
I discussed that in my original "all characters are unicode characters" proposal. in my proposal, the standard string type will have to roles: a string either contains unicode characters, or binary bytes.
-- if it contains unicode characters, python guarantees that methods like strip, lower (etc), and regular expressions work as expected.
-- if it contains binary data, you can still use indexing, slicing, find, split, etc. but they then work on bytes, not on chars.
it's still up to the programmer to keep track of what a certain string object is (a real string, a chunk of binary data, an en- coded string, a jpeg image, etc). if the programmer wants to convert between a unicode string and an external encoding to use a certain unicode encoding, she needs to spell it out. the codecs are never called "under the hood".
(note that if you encode a unicode string into some other encoding, the result is binary buffer. operations like strip, lower et al does *not* work on encoded strings).
Huh ? If the programmer already knows that a certain string uses a certain encoding, then he can just as well convert it to Unicode by hand using the right encoding name. The whole point we are talking about here is that when having the implementation convert a string to Unicode all by itself it needs to know which encoding to use. This is where we have decided long ago that UTF-8 should be used. The pragma discussion is about a totally different issue: pragmas could make it possible for the programmer to tell the *compiler* which encoding to use for literal u"unicode" strings -- nothing more. Since "8-bit" strings currently don't have an encoding attached to them we store them as-is. I don't want to get into designing a completely new character container type here... this can all be done for Py3K, but not now -- it breaks things at too many ends (even though it would solve the issues with strings being used in different contexts).
-- we still need an encoding marker for ascii supersets (how about <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to the tokenizer to detect that one, not the parser. the parser only sees unicode strings.
Hmm, the tokenizer doesn't do any string -> object conversion. That's a task done by the parser.
"unicode string" meant Py_UNICODE*, not PyUnicodeObject.
if the tokenizer does the actual conversion doesn't really matter; the point is that once the code has passed through the tokenizer, it's unicode.
The tokenizer would have to know which parts of the input string to convert to Unicode and which not... plus there are different encodings to be applied, e.g. UTF-8, Unicode-Escape, Raw-Unicode-Escape, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg <mal@lemburg.com> wrote:
but they won't -- if you don't use an encoding directive, and don't use 8-bit characters in your string literals, everything works as before.
(that's why the default is "none" and not "utf-8")
if you use 8-bit characters in your source code and wish to add an encoding directive, you need to add the right encoding directive...
Fair enough, but this would render all the auto-coercion code currently in 1.6 useless -- all string to Unicode conversions would have to raise an exception.
I though it was rather clear by now that I think the auto- conversion stuff *is* useless... but no, that doesn't mean that all string to unicode conversions need to raise exceptions -- any 8-bit unicode character obviously fits into a 16-bit unicode character, just like any integer fits in a long integer. if you convert the other way, you might get an OverflowError, just like converting from a long integer to an integer may give you an exception if the long integer is too large to be represented as an ordinary integer. after all, i = int(long(v)) doesn't always raise an exception...
why keep on pretending that strings and strings are two different things? it's an artificial distinction, and it only causes problems all over the place.
Sure. The point is that we can't just drop the old 8-bit strings... not until Py3K at least (and as Fred already said, all standard editors will have native Unicode support by then).
I discussed that in my original "all characters are unicode characters" proposal. in my proposal, the standard string type will have to roles: a string either contains unicode characters, or binary bytes.
-- if it contains unicode characters, python guarantees that methods like strip, lower (etc), and regular expressions work as expected.
-- if it contains binary data, you can still use indexing, slicing, find, split, etc. but they then work on bytes, not on chars.
it's still up to the programmer to keep track of what a certain string object is (a real string, a chunk of binary data, an en- coded string, a jpeg image, etc). if the programmer wants to convert between a unicode string and an external encoding to use a certain unicode encoding, she needs to spell it out. the codecs are never called "under the hood".
(note that if you encode a unicode string into some other encoding, the result is binary buffer. operations like strip, lower et al does *not* work on encoded strings).
Huh ? If the programmer already knows that a certain string uses a certain encoding, then he can just as well convert it to Unicode by hand using the right encoding name.
I thought that was what I said, but the text was garbled. let's try again: if the programmer wants to convert between a unicode string and a buffer containing encoded text, she needs to spell it out. the codecs are never called "under the hood"
The whole point we are talking about here is that when having the implementation convert a string to Unicode all by itself it needs to know which encoding to use. This is where we have decided long ago that UTF-8 should be used.
does "long ago" mean that the decision cannot be questioned? what's going on here? face it, I don't want to guess when and how the interpreter will convert strings for me. after all, this is Python, not Perl. if I want to convert from a "string of characters" to a byte buffer using a certain character encoding, let's make that explicit. Python doesn't convert between other data types for me, so why should strings be a special case?
The pragma discussion is about a totally different issue: pragmas could make it possible for the programmer to tell the *compiler* which encoding to use for literal u"unicode" strings -- nothing more. Since "8-bit" strings currently don't have an encoding attached to them we store them as-is.
what do I have to do to make you read my proposal? shout? okay, I'll try: THERE SHOULD BE JUST ONE INTERNAL CHARACTER SET IN PYTHON 1.6: UNICODE. for consistency, let this be true for both 8-bit and 16-bit strings (as well as Py3K's 31-bit strings ;-). there are many possible external string encodings, just like there are many possible external integer encodings. but for integers, that's not something that the core implementation cares much about. why are strings different?
I don't want to get into designing a completely new character container type here... this can all be done for Py3K, but not now -- it breaks things at too many ends (even though it would solve the issues with strings being used in different contexts).
you don't need to -- you only need to define how the *existing* string type should be used. in my proposal, it can be used in two ways: -- as a string of unicode characters (restricted to the 0-255 subset, by obvious reasons). given a string 's', len(s) is always the number of characters, s[i] is the i'th character, etc. or -- as a buffer containing binary bytes. given a buffer 'b', len(b) is always the number of bytes, b[i] is the i'th byte, etc. this is one flavour less than in the 1.6 alphas -- where strings sometimes contain UTF-8 (and methods like upper etc doesn't work), sometimes an 8-bit character set (and upper works), and sometimes binary buffers (for which upper doesn't work). (hmm. I've said all this before, haven't I?)
-- we still need an encoding marker for ascii supersets (how about <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to the tokenizer to detect that one, not the parser. the parser only sees unicode strings.
Hmm, the tokenizer doesn't do any string -> object conversion. That's a task done by the parser.
"unicode string" meant Py_UNICODE*, not PyUnicodeObject.
if the tokenizer does the actual conversion doesn't really matter; the point is that once the code has passed through the tokenizer, it's unicode.
The tokenizer would have to know which parts of the input string to convert to Unicode and which not... plus there are different encodings to be applied, e.g. UTF-8, Unicode-Escape, Raw-Unicode-Escape, etc.
sigh. why do you insist on taking a very simple thing and making it very very complicated? will anyone out there ever use an editor that supports different encodings for different parts of the file? why not just assume that the *ENTIRE SOURCE FILE* uses a single encoding, and let the tokenizer (or more likely, a conversion stage before the tokenizer) convert the whole thing to unicode. let the rest of the compiler work on Py_UNICODE* strings only, and all your design headaches will just disappear. ... frankly, I'm beginning to feel like John Skaller. do I have to write my own interpreter to get this done right? :-( </F>

Fredrik Lundh writes:
if the programmer wants to convert between a unicode string and a buffer containing encoded text, she needs to spell it out. the codecs are never called "under the hood"
Watching the successive weekly Unicode patchsets, each one fixing some obscure corner case that turned out to be buggy -- '%s' % ustr, concatenating literals, int()/float()/long(), comparisons -- I'm beginning to agree with Fredrik. Automatically making Unicode strings and regular strings interoperate looks like it requires many changes all over the place, and I worry if it's possible to catch them all in time. Maybe we should consider being more conservative, and just having the Unicode built-in type, the unicode() built-in function, and the u"..." notation, and then leaving all responsibility for conversions up to the user. On the other hand, *some* default conversion seems needed, because it seems draconian to make open(u"abcfile") fail with a TypeError. (While I want to see Python 1.6 expedited, I'd also not like to see it saddled with a system that proves to have been a mistake, or one that's a maintenance burden. If forced to choose between delaying and getting it right, the latter wins.)
why not just assume that the *ENTIRE SOURCE FILE* uses a single encoding, and let the tokenizer (or more likely, a conversion stage before the tokenizer) convert the whole thing to unicode.
To reinforce Fredrik's point here, note that XML only supports encodings at the level of an entire file (or external entity). You can't tell an XML parser that a file is in UTF-8, except for this one element whose contents are in Latin1. -- A.M. Kuchling http://starship.python.net/crew/amk/ Dream casts a human shadow, when it occurs to him to do so. -- From SANDMAN: "Season of Mists", episode 0

"Andrew M. Kuchling" wrote:
why not just assume that the *ENTIRE SOURCE FILE* uses a single encoding, and let the tokenizer (or more likely, a conversion stage before the tokenizer) convert the whole thing to unicode.
To reinforce Fredrik's point here, note that XML only supports encodings at the level of an entire file (or external entity). You can't tell an XML parser that a file is in UTF-8, except for this one element whose contents are in Latin1.
Hmm, this would mean that someone who writes: """ #pragma script-encoding utf-8 u = u"\u1234" print u """ would suddenly see "\u1234" as output. If that's ok, fine with me... it would make things easier on the compiler side (even though I'm pretty sure that people won't like this). BTW: I will be offline for the next week... I'm looking forward to where this dicussion will be heading. Have fun, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
To reinforce Fredrik's point here, note that XML only supports encodings at the level of an entire file (or external entity). You can't tell an XML parser that a file is in UTF-8, except for this one element whose contents are in Latin1.
Hmm, this would mean that someone who writes:
""" #pragma script-encoding utf-8
u = u"\u1234" print u """
would suddenly see "\u1234" as output.
not necessarily. consider this XML snippet: <?xml version='1.0' encoding='utf-8'?> <body>ሴ</body> if I run this through an XML parser and write it out as UTF-8, I get: <body>á^´</body> in other words, the parser processes "" after decoding to unicode, not before. I see no reason why Python cannot do the same. </F>

I can see the dilemma, but...
Maybe we should consider being more conservative, and just having the Unicode built-in type, the unicode() built-in function, and the u"..." notation, and then leaving all responsibility for conversions up to the user.
Win32 and COM has been doing exactly this for the last couple of years. And it sucked.
On the other hand, *some* default conversion seems needed, because it seems draconian to make open(u"abcfile") fail with a TypeError.
For exactly this reason. The end result is that the first thing you ever do with a Unicode object is convert it to a string.
(While I want to see Python 1.6 expedited, I'd also not like to see it saddled with a system that proves to have been a mistake, or one that's a maintenance burden. If forced to choose between delaying and getting it right, the latter wins.)
Agreed. I thought this implementation stemmed from Guido's desire to do it this way in the 1.x family, and move towards Fredrik's proposal for Py3k. As a geneal comment: Im a little confused and dissapointed here. We are all bickering like children while our parents are away. All we are doing is creating a _huge_ pile of garbage for Guido to ignore when he returns. We are going to be presenting Guido with around 400 messages at my estimate. He can't possibly read them all. So the end result is that all the posturing and flapping going on here is for naught, and he is just going to do whatever he wants anyway - as he always has done, and as has worked so well for Python. Sheesh - we should all consider how we can be the most effective, not the most loud or aggressive! Mark.

Mark Hammond wrote:
I thought this implementation stemmed from Guido's desire to do it this way in the 1.x family, and move towards Fredrik's proposal for Py3k.
Right. Let's do this step by step and get some experience first. With that gained experience we can still polish up the design towards a compromise which best suits all our needs. The integration of Unicode into Python is comparable to the addition of floats to an interpreter which previously only understood integers -- things are obviously going to be a little different than before. Our goal should be to make it as painless as possible and at least IMHO this can only be achieved by gaining practical experience in this new field first. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Right. Let's do this step by step and get some experience first. With that gained experience we can still polish up the design towards a compromise which best suits all our needs.
so practical experience from other languages, other designs, and playing with the python alphas doesn't count?
The integration of Unicode into Python is comparable to the addition of floats to an interpreter which previously only understood integers.
use "long integers" instead of "floats", and you'll get closer to the actual case. but where's the problem? python has solved this problem for numbers, and what's more important: the language reference tells us how strings are supposed to work: "The items of a string are characters." (see previous mail) "Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters." this solves most of the issues. to handle the rest, look at the language reference description of integer: [Integers] represent elements from the mathematical set of whole numbers. Borrowing the "elements from a single set" concept, define characters as Characters represent elements from the unicode character set. and let all mixed-string operations use string coercion, just like numbers. can it be much simpler? </F>

Fredrik Lundh wrote:
M.-A. Lemburg <mal@lemburg.com> wrote:
but they won't -- if you don't use an encoding directive, and don't use 8-bit characters in your string literals, everything works as before.
(that's why the default is "none" and not "utf-8")
if you use 8-bit characters in your source code and wish to add an encoding directive, you need to add the right encoding directive...
Fair enough, but this would render all the auto-coercion code currently in 1.6 useless -- all string to Unicode conversions would have to raise an exception.
I though it was rather clear by now that I think the auto- conversion stuff *is* useless...
but no, that doesn't mean that all string to unicode conversions need to raise exceptions -- any 8-bit unicode character obviously fits into a 16-bit unicode character, just like any integer fits in a long integer.
if you convert the other way, you might get an OverflowError, just like converting from a long integer to an integer may give you an exception if the long integer is too large to be represented as an ordinary integer. after all,
i = int(long(v))
doesn't always raise an exception...
This is exactly the same as proposing to change the default encoding to Latin-1. I don't have anything against that (being a native Latin-1 user :), but I would assume that other native language writer sure do: e.g. all programmers not using Latin-1 as native encoding (and there are lots of them).
why keep on pretending that strings and strings are two different things? it's an artificial distinction, and it only causes problems all over the place.
Sure. The point is that we can't just drop the old 8-bit strings... not until Py3K at least (and as Fred already said, all standard editors will have native Unicode support by then).
I discussed that in my original "all characters are unicode characters" proposal. in my proposal, the standard string type will have to roles: a string either contains unicode characters, or binary bytes.
-- if it contains unicode characters, python guarantees that methods like strip, lower (etc), and regular expressions work as expected.
-- if it contains binary data, you can still use indexing, slicing, find, split, etc. but they then work on bytes, not on chars.
it's still up to the programmer to keep track of what a certain string object is (a real string, a chunk of binary data, an en- coded string, a jpeg image, etc). if the programmer wants to convert between a unicode string and an external encoding to use a certain unicode encoding, she needs to spell it out. the codecs are never called "under the hood".
(note that if you encode a unicode string into some other encoding, the result is binary buffer. operations like strip, lower et al does *not* work on encoded strings).
Huh ? If the programmer already knows that a certain string uses a certain encoding, then he can just as well convert it to Unicode by hand using the right encoding name.
I thought that was what I said, but the text was garbled. let's try again:
if the programmer wants to convert between a unicode string and a buffer containing encoded text, she needs to spell it out. the codecs are never called "under the hood"
Again and again... The orginal intent of the Unicode integration was trying to make Unicode and 8-bit strings interoperate without too much user intervention. At a cost (the UTF-8 encoding), but then if you do use this encoding (and this is not far fetched since there are input sources which do return UTF-8, e.g. TCL), the Unicode implementation will apply all its knowledge in order to get you satisfied. If you don't like this, you can always apply explicit conversion calls wherever needed. Latin-1 and UTF-8 are not compatible, the conversion is very likely to cause an exception, so the user will indeed be informed about this failure.
The whole point we are talking about here is that when having the implementation convert a string to Unicode all by itself it needs to know which encoding to use. This is where we have decided long ago that UTF-8 should be used.
does "long ago" mean that the decision cannot be questioned? what's going on here?
face it, I don't want to guess when and how the interpreter will convert strings for me. after all, this is Python, not Perl.
if I want to convert from a "string of characters" to a byte buffer using a certain character encoding, let's make that explicit.
Hey, there's nothing which prevents you from doing so explicitly.
Python doesn't convert between other data types for me, so why should strings be a special case?
Sure it does: 1.5 + 2 == 3.5, 2L + 3 == 5L, etc...
The pragma discussion is about a totally different issue: pragmas could make it possible for the programmer to tell the *compiler* which encoding to use for literal u"unicode" strings -- nothing more. Since "8-bit" strings currently don't have an encoding attached to them we store them as-is.
what do I have to do to make you read my proposal?
shout?
okay, I'll try:
THERE SHOULD BE JUST ONE INTERNAL CHARACTER SET IN PYTHON 1.6: UNICODE.
Please don't shout... simply read on... Note that you are again argueing for using Latin-1 as default encoding -- why don't you simply make this fact explicit ?
for consistency, let this be true for both 8-bit and 16-bit strings (as well as Py3K's 31-bit strings ;-).
there are many possible external string encodings, just like there are many possible external integer encodings. but for integers, that's not something that the core implementation cares much about. why are strings different?
I don't want to get into designing a completely new character container type here... this can all be done for Py3K, but not now -- it breaks things at too many ends (even though it would solve the issues with strings being used in different contexts).
you don't need to -- you only need to define how the *existing* string type should be used. in my proposal, it can be used in two ways:
-- as a string of unicode characters (restricted to the 0-255 subset, by obvious reasons). given a string 's', len(s) is always the number of characters, s[i] is the i'th character, etc.
or
-- as a buffer containing binary bytes. given a buffer 'b', len(b) is always the number of bytes, b[i] is the i'th byte, etc.
this is one flavour less than in the 1.6 alphas -- where strings sometimes contain UTF-8 (and methods like upper etc doesn't work), sometimes an 8-bit character set (and upper works), and sometimes binary buffers (for which upper doesn't work).
Strings always contain data -- there's no encoding attached to them. If the user calls .upper() on a binary string the output will most probably no longer be usable... but that's the programmers fault, not the string type's fault.
(hmm. I've said all this before, haven't I?)
You know as well as I do that the existing string type is used for both binary and text data. You cannot simply change this by introducing some new definition of what should be stored in buffers and what in strings... not until we officially redefined these things say in Py3K ;-)
frankly, I'm beginning to feel like John Skaller. do I have to write my own interpreter to get this done right? :-(
No, but you should have started this discussion in late November last year... not now, when everything has already been implemented and people are starting to the use the code that's there with great success. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

This is exactly the same as proposing to change the default encoding to Latin-1.
no, it isn't. here's what I'm proposing: -- the internal character set is unicode, and nothing but unicode. in 1.6, this applies to strings. in 1.7 or later, it applies to source code as well. -- the default source encoding is "unknown" -- the is no other default encoding. all strings use the unicode character set. to give you some background, let's look at section 3.2 of the existing language definition: [Sequences] represent finite ordered sets indexed by natural numbers. The built-in function len() returns the number of items of a sequence. When the length of a sequence is n, the index set contains the numbers 0, 1, ..., n-1. Item i of sequence a is selected by a[i]. An object of an immutable sequence type cannot change once it is created. The items of a string are characters. There is no separate character type; a character is represented by a string of one item. Characters represent (at least) 8-bit bytes. The built-in functions chr() and ord() convert between characters and nonnegative integers representing the byte values. Bytes with the values 0-127 usually represent the corre- sponding ASCII values, but the interpretation of values is up to the program. The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. (in other words, given a string s, len(s) is the number of characters in the string. s[i] is the i'th character. len(s[i]) is 1. etc. the existing string type doubles as byte arrays, where given an array b, len(b) is the number of bytes, b[i] is the i'th byte, etc). my proposal boils down to a few small changes to the last three sentences in the definition. basically, change "byte value" to "character code" and "ascii" to "unicode": The built-in functions chr() and ord() convert between characters and nonnegative integers representing the character codes. Character codes usually represent the corresponding unicode values. The 8-bit string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. that's all. the rest follows from this. ... just a few quickies to sort out common misconceptions:
I don't have anything against that (being a native Latin-1 user :), but I would assume that other native language writer sure do: e.g. all programmers not using Latin-1 as native encoding (and there are lots of them).
the unicode folks have already made that decision. I find it very strange that we should use *another* model for the first 256 characters, just to "equally annoy everyone". (if people have a problem with the first 256 unicode characters having the same internal representation as the ISO 8859-1 set, tell them to complain to the unicode folks).
(and this is not far fetched since there are input sources which do return UTF-8, e.g. TCL), the Unicode implementation will apply all its knowledge in order to get you satisfied.
there are all sorts of input sources. major platforms like windows and java use 16-bit unicode. and Tcl has an internal unicode string type, since they realized that storing UTF-8 in 8-bit strings was horridly inefficient (they tried to do it right, of course). the internal type looks like this: typedef unsigned short Tcl_UniChar; typedef struct String { int numChars; size_t allocated; size_t uallocated; Tcl_UniChar unicode[2]; } String; (Tcl uses dual-ported objects, where each object can have an UTF-8 string representation in addition to the internal representation. if you change one of them, the other is recalculated on demand) in fact, it's Tkinter that converts the return value to UTF-8, not Tcl. that can be fixed.
Python doesn't convert between other data types for me, so why should strings be a special case?
Sure it does: 1.5 + 2 == 3.5, 2L + 3 == 5L, etc...
but that's the key point: 2L and 3 are both integers, from the same set of integers. if you convert a long integer to an integer, it still contains an integer from the same set. (maybe someone can fill me in here: what's the formally correct word here? set? domain? category? universe?) also, if you convert every item in a sequence of long integers to ordinary integers, all items are still members of the same integer set. in contrast, the UTF-8 design converts between strings of characters, and arrays of bytes. unless you change the 8-bit string type to know about UTF-8, that means that you change string items from one domain (characters) to another (bytes).
Note that you are again argueing for using Latin-1 as default encoding -- why don't you simply make this fact explicit ?
nope. I'm standardizing on a character set, not an encoding. character sets are mapping between integers and characters. in this case, we use the unicode character set. encodings are ways to store strings of text as bytes in a byte array.
not now, when everything has already been implemented and people are starting to the use the code that's there with great success.
the positive reports I've seen all rave about the codec frame- work. that's a great piece of work. without that, it would have been impossible to do what I'm proposing. (so what are you complaining about? it's all your fault -- if you hadn't done such a great job on that part of the code, I wouldn't have noticed the warts ;-) if you look at my proposal from a little distance, you'll realize that it doesn't really change much. all that needs to be done is to change some of the conversion stuff. if we decide to do this, I can do the work for you, free of charge. </F>

Marc> We were originally talking about proposals to integrate #pragmas Marc> ... Minor nit... How about we lose the "#" during these discussions so we aren't all subliminally disposed to embed pragmas in comments or to add the C preprocessor to Python? ;-) -- Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/

Skip Montanaro wrote:
Marc> We were originally talking about proposals to integrate #pragmas Marc> ...
Minor nit... How about we lose the "#" during these discussions so we aren't all subliminally disposed to embed pragmas in comments or to add the C preprocessor to Python? ;-)
Hmm, anything else would introduce a new keyword, I guess. And new keywords cause new scripts to fail in old interpreters even when they don't use Unicode at all and only include <whatever the name is> per convention. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Marc> Skip Montanaro wrote: >> Minor nit... How about we lose the "#" during these discussions so >> we aren't all subliminally disposed to embed pragmas in comments or >> to add the C preprocessor to Python? ;-) Marc> Hmm, anything else would introduce a new keyword, I guess. And new Marc> keywords cause new scripts to fail in old interpreters even when Marc> they don't use Unicode at all and only include <whatever the name Marc> is> per convention. My point was only that using "#pragma" (or even "pragma") sort of implies we have our eye on a solution, but I don't think we're far enough down the path of answering what we want to have any concrete ideas about how to implement it. I think this thread started (more-or-less) when Guido posted an idea that originally surfaced on the idle-dev list about using "global ..." to implement functionality like this. It's not clear to me at this point what the best course might be. Skip

M.-A. Lemburg writes:
Hmm, anything else would introduce a new keyword, I guess. And new keywords cause new scripts to fail in old interpreters even when they don't use Unicode at all and only include <whatever the name is> per convention.
Only if the new keyword is used in the script or anything it imports. This is exactly like using new syntax (u'...') or new library features (unicode('abc', 'iso-8859-1')). I can't think of anything that gets included "by convention" that breaks anything. I don't recall a proposal that we should casually add pragmas to our scripts if there's no need to do so. Adding pragmas to library modules is *not* part of the issue; they'd only be there if the version of Python they're part of supports the syntax. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
M.-A. Lemburg writes:
Hmm, anything else would introduce a new keyword, I guess. And new keywords cause new scripts to fail in old interpreters even when they don't use Unicode at all and only include <whatever the name is> per convention.
Only if the new keyword is used in the script or anything it imports. This is exactly like using new syntax (u'...') or new library features (unicode('abc', 'iso-8859-1')).
Right, but I would guess that people would then start using these keywords in all files per convention (so as not to trip over bugs due to wrong encodings). Perhaps I'm overcautious here...
I can't think of anything that gets included "by convention" that breaks anything. I don't recall a proposal that we should casually add pragmas to our scripts if there's no need to do so. Adding pragmas to library modules is *not* part of the issue; they'd only be there if the version of Python they're part of supports the syntax.
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Right, but I would guess that people would then start using these keywords in all files per convention (so as not to trip over bugs due to wrong encodings).
I don't imagine the new keywords would be used by anyone that wasn't specifically interested in their effect. Code that isn't needed tends not to get written! -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

"M.-A. Lemburg" wrote:
... The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings..."). The idea behind this is that programmers should be able to use other encodings here than the default "unicode-escape" one.
I'm totally confused about this. Are we going to allow UCS-2 sequences in the middle of Python programs that are otherwise ASCII? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself

Paul Prescod wrote:
"M.-A. Lemburg" wrote:
... The current need for #pragmas is really very simple: to tell the compiler which encoding to assume for the characters in u"...strings..." (*not* "...8-bit strings..."). The idea behind this is that programmers should be able to use other encodings here than the default "unicode-escape" one.
I'm totally confused about this. Are we going to allow UCS-2 sequences in the middle of Python programs that are otherwise ASCII?
The idea is to make life a little easier for programmers who's native script is not easily writable using ASCII, e.g. the whole Asian world. While originally only the encoding used within the quotes of u"..." was targetted (on the i18n sig), there has now been some discussion on this list about whether to move forward in a whole new direction: that of allowing whole Python scripts to be encoded in many different encodings. The compiler will then convert the scripts first to Unicode and then to 8-bit strings as needed. Using this technique which was introduced by Fredrik Lundh we could in fact have Python scripts which are encoded in UTF-16 (two bytes per character) or other more obscure encodings. The Python interpreter would only see Unicode and Latin-1. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

The idea is to make life a little easier for programmers who's native script is not easily writable using ASCII, e.g. the whole Asian world.
While originally only the encoding used within the quotes of u"..." was targetted (on the i18n sig), there has now been some discussion on this list about whether to move forward in a whole new direction: that of allowing whole Python scripts to be encoded in many different encodings. The compiler will then convert the scripts first to Unicode and then to 8-bit strings as needed.
Using this technique which was introduced by Fredrik Lundh we could in fact have Python scripts which are encoded in UTF-16 (two bytes per character) or other more obscure encodings. The Python interpreter would only see Unicode and Latin-1.
Wouldn't it make more sense to have the Python compiler *always* see UTF-8 and to use a simple preprocessor to deal with encodings? (Disclaimer: there are about 300 unread python-dev messages in my inbox still.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Using this technique which was introduced by Fredrik Lundh we could in fact have Python scripts which are encoded in UTF-16 (two bytes per character) or other more obscure encodings. The Python interpreter would only see Unicode and Latin-1.
Wouldn't it make more sense to have the Python compiler *always* see UTF-8 and to use a simple preprocessor to deal with encodings?
to some extent, this depends on what the "everybody" in CP4E means -- if you were to do user-testing on non-americans, I suspect "why cannot I use my own name as a variable name" might be as common as "why are SPAM and spam two different variables?". and if you're willing to address both issues in Py3K, it's much easier to use a simple internal representation, and handle en- codings on the way in and out. and PY_UNICODE* strings are easier to process than UTF-8 encoded char* strings... </F>

M.-A. Lemburg writes:
The idea is to make life a little easier for programmers who's native script is not easily writable using ASCII, e.g. the whole Asian world.
While originally only the encoding used within the quotes of u"..." was targetted (on the i18n sig), there has now been some discussion on this list about whether to move forward in a whole new direction: that of allowing whole Python scripts
I had thought this was still an issue for interpretation of string contents, and really only meaningful when converting the source representations of Unicode strings to the internal represtenation. I see no need to change the language definition in general. Unless we *really* want to impose those evil trigraph sequences from C! ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

Fred Drake wrote:
While originally only the encoding used within the quotes of u"..." was targetted (on the i18n sig), there has now been some discussion on this list about whether to move forward in a whole new direction: that of allowing whole Python scripts
I had thought this was still an issue for interpretation of string contents, and really only meaningful when converting the source representations of Unicode strings to the internal represtenation.
why restrict the set of possible source encodings to ASCII compatible 8-bit encodings? (or are there really authoring systems out there that can use different encodings for different parts of the file?)
I see no need to change the language definition in general. Unless we *really* want to impose those evil trigraph sequences from C! ;)
sorry, but I don't see the connection. </F>

Fredrik Lundh writes:
why restrict the set of possible source encodings to ASCII compatible 8-bit encodings?
I'm not suggesting that. I just don't see any call to change the language definition (such as allowing additional characters in NAME tokens). I don't mind whatsoever if the source is stored in UCS-2, and the tokenizer does need to understand that to create the right value for Unicode strings specified as u'...' literals.
(or are there really authoring systems out there that can use different encodings for different parts of the file?)
Not that I know of, and I doubt I'd want to see the result! -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

My vote is all or nothing. Either the whole file is in UCS-2 (for example) or none of it is. I'm not sure if we really need to allow multiple file encodings in version 1.6 but we do need to allow that ultimately. If we agree to allow the whole file to be in another encoding then we should use the XML trick of having a known start-sequence for encodings other than UTF-8. It doesn't matter much whether it is syntactically a comment or a pragma. I am still in favor of compile time pragmas but they can probably wait for Python 1.7.
Using this technique which was introduced by Fredrik Lundh we could in fact have Python scripts which are encoded in UTF-16 (two bytes per character) or other more obscure encodings. The Python interpreter would only see Unicode and Latin-1.
In what sense is Latin-1 not Unicode? Isn't it just the first 256 characters of Unicode or something like that? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself [In retrospect] the story of a Cold War that was the scene of history's only nuclear arms race will be very different from the story of a Cold War that turned out to be only the first of many interlocking nuclear arms races in many parts of the world. The nuclear, question, in sum, hangs like a giant question mark over our waning century. - The Unfinished Twentieth Century by Jonathan Schell Harper's Magazine, January 2000

Paul Prescod wrote:
My vote is all or nothing. Either the whole file is in UCS-2 (for example) or none of it is.
agreed.
In what sense is Latin-1 not Unicode? Isn't it just the first 256 characters of Unicode or something like that?
yes. ISO Latin-1 is unicode. what MAL really meant was that the interpreter would only deal with 8-bit (traditional) or 16-bit (unicode) strings. (in my string type proposals, the same applies to text strings manipulated by the user. if it's not unicode, it's a byte array, and methods expecting text don't work) </F>
participants (13)
-
Andrew M. Kuchling
-
Da Silva, Mike
-
esr@thyrsus.com
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Guido van Rossum
-
gvwilson@nevex.com
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Mark Hammond
-
Moshe Zadka
-
Paul Prescod
-
Skip Montanaro