Pre-PEP: Python Character Model
I went to a very interesting talk about internationalization by Tim Bray, one of the editors of the XML spec and a real expert on i18n. It inspired me to wrestle one more time with the architectural issues in Python that are preventing us from saying that it is a really internationalized language. Those geek cruises aren't just about sun, surf and sand. There's a pretty high level of intellectual give and take also! Email me for more info... Anyhow, we deferred many of these issues (probably out of exhaustion) the last time we talked about it but we cannot and should not do so forever. In particular, I do not think that we should add more features for working with Unicode (e.g. unichr) before thinking through the issues. ----- Abstract Many of the world's written languages have more than 255 characters. Therefore Python is out of date in its insistence that "basic strings" are lists of characters with ordinals between 0 and 255. Python's basic character type must allow at least enough digits for Eastern languages. Problem Description Python's western bias stems from a variety of issues. The first problem is that Python's native character type is an 8-bit character. You can see that it is an 8-bit character by trying to insert a value with an ordinal higher than 255. Python should allow for ordinal numbers up to at least the size of a single Eastern language such as Chinese or Japanese. Whenever a Python file object is "read", it returns one of these lists of 8-byte characters. The standard file object "read" method can never return a list of Chinese or Japanese characters. This is an unacceptable state of affairs in the 21st century. Goals 1. Python should have a single string type. It should support Eastern characters as well as it does European characters. Operationally speaking: type("") == type(chr(150)) == type(chr(1500)) == type(file.read()) 2. It should be easier and more efficient to encode and decode information being sent to and retrieved from devices. 3. It should remain possible to work with the byte-level representation. This is sometimes useful for for performance reasons. Definitions Character Set A character set is a mapping from integers to characters. Note that both integers and characters are abstractions. In other words, a decision to use a particular character set does not in any way mandate a particular implementation or representation for characters. In Python terms, a character set can be thought of as no more or less than a pair of functions: ord() and chr(). ASCII, for instance, is a pair of functions defined only for 0 through 127 and ISO Latin 1 is defined only for 0 through 255. Character sets typically also define a mapping from characters to names of those characters in some natural language (often English) and to a simple graphical representation that native language speakers would recognize. It is not possible to have a concept of "character" without having a character set. After all, characters must be chosen from some repertoire and there must be a mapping from characters to integers (defined by ord). Character Encoding A character encoding is a mechanism for representing characters in terms of bits. Character encodings are only relevant when information is passed from Python to some system that works with the characters in terms of representation rather than abstraction. Just as a Python programmer would not care about the representation of a long integer, they should not care about the representation of a string. Understanding the distinction between an abstract character and its bit level representation is essential to understanding this Python character model. A Python programmer does not need to know or care whether a long integer is represented as twos complement, ones complement or in terms of ASCII digits. Similarly a Python programmer does not need to know or care how characters are represented in memory. We might even change the representation over time to achieve higher performance. Universal Character Set There is only one standardized international character set that allows for mixed-language information. It is called the Universal Character Set and it is logically defined for characters 0 through 2^32 but practically is deployed for characters 0 through 2^16. The Universal Character Set is an international standard in the sense that it is standardized by ISO and has the force of law in international agreements. A popular subset of the Universal Character Set is called Unicode. The most popular subset of Unicode is called the "Unicode Basic Multilingual Plane (Unicode BMP)". The Unicode BMP has space for all of the world's major languages including Chinese, Korean, Japanese and Vietnamese. There are 2^16 characters in the Unicode BMP. The Unicode BMP subset of UCS is becoming a defacto standard on the Web. In any modern browser you can create an HTML or XML document with ĭ and get back a rendered version of Unicode character 301. In other words, Unicode is becoming the defato character set for the Internet in addition to being the officially mandated character set for international commerce. In addition to defining ord() and chr(), Unicode provides a database of information about characters. Each character has an english language name, a classification (letter, number, etc.) a "demonstration" glyph and so forth. The Unicode Contraversy Unicode is not entirely uncontroversial. In particular there are Japanese speakers who dislike the way Unicode merges characters from various languages that were considered "the same" by the experts that defined the specification. Nevertheless Unicode is in used as the character set for important Japanese software such as the two most popular word processors, Ichitaro and Microsoft Word. Other programming languages have also moved to use Unicode as the basic character set instead of ASCII or ISO Latin 1. From memory, I believe that this is the case for: Java Perl JavaScript Visual Basic TCL XML is also Unicode based. Note that the difference between all of these languages and Python is that Unicode is the *basic* character type. Even when you type ASCII literals, they are immediately converted to Unicode. It is the author's belief this "running code" is evidence of Unicode's practical applicability. Arguments against it seem more rooted in theory than in practical problems. On the other hand, this belief is informed by those who have done heavy work with Asian characters and not based on my own direct experience. Python Character Set As discussed before, Python's native character set happens to consist of exactly 255 characters. If we increase the size of Python's character set, no existing code would break and there would be no cost in functionality. Given that Unicode is a standard character set and it is richer than that of Python's, Python should move to that character set. Once Python moves to that character set it will no longer be necessary to have a distinction between "Unicode string" and "regular string." This means that Unicode literals and escape codes can also be merged with ordinary literals and escape codes. unichr can be merged with chr. Character Strings and Byte Arrays Two of the most common constructs in computer science are strings of characters and strings of bytes. A string of bytes can be represented as a string of characters between 0 and 255. Therefore the only reason to have a distinction between Unicode strings and byte strings is for implementation simplicity and performance purposes. This distinction should only be made visible to the average Python programmer in rare circumstances. Advanced Python programmers will sometimes care about true "byte strings". They will sometimes want to build and parse information according to its representation instead of its abstract form. This should be done with byte arrays. It should be possible to read bytes from and write bytes to arrays. It should also be possible to use regular expressions on byte arrays. Character Encodings for I/O Information is typically read from devices such as file systems and network cards one byte at a time. Unicode BMP characters can have values up to 2^16 (or even higher, if you include all of UCS). There is a fundamental disconnect there. Each character cannot be represented as a single byte anymore. To solve this problem, there are several "encodings" for large characters that describe how to represent them as series of bytes. Unfortunately, there is not one, single, dominant encoding. There are at least a dozen popular ones including ASCII (which supports only 0-127), ISO Latin 1 (which supports only 0-255), others in the ISO "extended ASCII" family (which support different European scripts), UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by Java and Windows), Shift-JIS (preferred in Japan) and so forth. This means that the only safe way to read data from a file into Python strings is to specify the encoding explicitly. Python's current assumption is that each byte translates into a character of the same ordinal. This is only true for "ISO Latin 1". Python should require the user to specify this explicitly instead. Any code that does I/O should be changed to require the user to specify the encoding that the I/O should use. It is the opinion of the author that there should be no default encoding at all. If you want to read ASCII text, you should specify ASCII explicitly. If you want to read ISO Latin 1, you should specify it explicitly. Once data is read into Python objects the original encoding is irrelevant. This is similar to reading an integer from a binary file, an ASCII file or a packed decimal file. The original bits and bytes representation of the integer is disconnected from the abstract representation of the integer object. Proposed I/O API This encoding could be chosen at various levels. In some applications it may make sense to specify the encoding on every read or write as an extra argument to the read and write methods. In most applications it makes more sense to attach that information to the file object as an attribute and have the read and write methods default the encoding to the property value. This attribute value could be initially set as an extra argument to the "open" function. Here is some Python code demonstrating a proposed API: fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 fileobj2 = fopen("bar", "r", "ISO Latin 1") # byte-values "as is" fileobj3 = fopen("baz", "r", "UTF-8") fileobj2.encoding = "UTF-16" # changed my mind! data = fileobj2.read(1024, "UTF-8" ) # changed my mind again For efficiency, it should also be possible to read raw bytes into a memory buffer without doing any interpretation: moredata = fileobj2.readbytes(1024) This will generate a byte array, not a character string. This is logically equivalent to reading the file as "ISO Latin 1" (which happens to map bytes to characters with the same ordinals) and generating a byte array by copying characters to bytes but it is much more efficient. Python File Encoding It should be possible to create Python files in any of the common encodings that are backwards compatible with ASCII. This includes ASCII itself, all language-specific "extended ASCII" variants (e.g. ISO Latin 1), Shift-JIS and UTF-8 which can actually encode any UCS character value. The precise variant of "super-ASCII" must be declared with a specialized comment that precedes any other lines other than the shebang line if present. It has a syntax like this: #?encoding="UTF-8" #?encoding="ISO-8859-1" ... #?encoding="ISO-8859-9" #?encoding="Shift_JIS" For now, this is the complete list of legal encodings. Others may be added in the future. Python files which use non-ASCII characters without defining an encoding should be immediately deprecated and made illegal in some future version of Python. C APIs The only time representation matters is when data is being moved from Python's internal model to something outside of Python's control or vice versa. Reading and writing from a device is a special case discussed above. Sending information from Python to C code is also an issue. Python already has a rule that allows the automatic conversion of characters up to 255 into their C equivalents. Once the Python character type is expanded, characters outside of that range should trigger an exception (just as converting a large long integer to a C int triggers an exception). Some might claim it is inappropriate to presume that the character-for- byte mapping is the correct "encoding" for information passing from Python to C. It is best not to think of it as an encoding. It is merely the most straightforward mapping from a Python type to a C type. In addition to being straightforward, I claim it is the best thing for several reasons: * It is what Python already does with string objects (but not Unicode objects). * Once I/O is handled "properly", (see above) it should be extremely rare to have characters in strings above 128 that mean anything OTHER than character values. Binary data should go into byte arrays. * It preserves the length of the string so that the length C sees is the same as the length Python sees. * It does not require us to make an arbitrary choice of UTF-8 versus UTF-16. * It means that C extensions can be internationalized by switching from C's char type to a wchar_t and switching from the string format code to the Unicode format code. Python's built-in modules should migrate from char to wchar_t (aka Py_UNICODE) over time. That is, more and more functions should support characters greater than 255 over time. Rough Implementation Requirements Combine String and Unicode Types: The StringType and UnicodeType objects should be aliases for the same object. All PyString_* and PyUnicode_* functions should work with objects of this type. Remove Unicode String Literals Ordinary string literals should allow large character escape codes and generate Unicode string objects. Unicode objects should "repr" themselves as Python string objects. Unicode string literals should be deprecated. Generalize C-level Unicode conversion The format string "S" and the PyString_AsString functions should accept Unicode values and convert them to character arrays by converting each value to its equivalent byte-value. Values greater than 255 should generate an exception. New function: fopen fopen should be like Python's current open function except that it should allow and require an encoding parameter. The file objects returned by it should be encoding aware. fopen should be considered a replacement for open. open should eventually be deprecated. Add byte arrays The regular expression library should be generalized to handle byte arrays without converting them to Python strings. This will allow those who need to work with bytes to do so more efficiently. In general, it should be possible to use byte arrays where-ever it is possible to use strings. Byte arrays could be thought of as a special kind of "limited but efficient" string. Arguably we could go so far as to call them "byte strings" and reuse Python's current string implementation. The primary differences would be in their "repr", "type" and literal syntax. In a sense we would have kept the existing distinction between Unicode strings and 8-bit strings but made Unicode the "default" and provided 8-bit strings as an efficient alternative. Appendix: Using Non-Unicode character sets Let's presume that a linguistics researcher objected to the unification of Han characters in Unicode and wanted to invent a character set that included separate characters for all Chinese, Japanese and Korean character sets. Perhaps they also want to support some non-standard character set like Klingon. Klingon is actually scheduled to become part of Unicode eventually but let's presume it wasn't. This section will demonstrate that this researcher is no worse off under the new system than they were under historical Python. Adopting Unicode as a standard has no down-side for someone in this situation. They have several options under the new system: 1. Ignore Unicode Read in the bytes using the encoding "RAW" which would mean that each byte would be translated into a character between 0 and 255. It would be a synonym for ISO Latin 1. Now you can process the data using exactly the same Python code that you would have used in Python 1.5 through Python 2.0. The only difference is that the in-memory representation of the data MIGHT be less space efficient because Unicode characters MIGHT be implemented internally as 16 or 32 bit integers. This solution is the simplest and easiest to code. 2. Use Byte Arrays As dicussed earlier, a byte array is like a string where the characters are restricted to characters between 0 and 255. The only virtues of byte arrays are that they enforce this rule and they can be implemented in a more memory-efficient manner. According to the proposal, it should be possible to load data into a byte array (or "byte string") using the "readbytes" method. This solution is the most efficient. 3. Use Unicode's Private Use Area (PUA) Unicode is an extensible standard. There are certain character codes reserved for private use between consenting parties. You could map characters like Klingon or certain Korean ideographs into the private use area. Obviously the Unicode character database would not have meaningful information about these characters and rendering systems would not know how to render them. But this situation is no worse than in today's Python. There is no character database for arbitrary character sets and there is no automatic way to render them. One limitation to this issue is that the Private Use Area can only handle so many characters. The BMP PUA can hold thousands and if we step up to "full" Unicode support we have room for hundreds of thousands. This solution gets the maximum benefit from Unicode for the characters that are defined by Unicode without losing the ability to refer to characters outside of Unicode. 4. Use A Higher Level Encoding You could wrap Korean characters in <KOREA>...</KOREA> tags. You could describe a characters as \KLINGON-KAHK (i.e. 13 Unicode characters). You could use a special Unicode character as an "escape flag" to say that the next character should be interpreted specially. This solution is the most self-descriptive and extensible. In summary, expanding Python's character type to support Unicode characters does not restrict even the most estoric, Unicode-hostile types of text processing. Therefore there is no basis for objecting to Unicode as some form of restriction. Those who need to use another logial character set have as much ability to do so as they always have. Conclusion Python needs to support international characters. The "ASCII" of internationalized characters is Unicode. Most other languages have moved or are moving their basic character and string types to support Unicode. Python should also.
[pre-PEP] You have a lot of good points in there (also some inaccuracies) and I agree that Python should move to using Unicode for text data and arrays for binary data. Some things you may be missing though is that Python already has support for a few features you mention, e.g. codecs.open() provide more or less what you have in mind with fopen() and the compiler can already unify Unicode and string literals using the -U command line option. What you don't talk about in the PEP is that Python's stdlib isn't even Unicode aware yet, and whatever unification steps we take, this project will have to preceed it. The problem with making the stdlib Unicode aware is that of deciding which parts deal with text data or binary data -- the code sometimes makes assumptions about the nature of the data and at other times it simply doesn't care. In this light I think you ought to focus Python 3k with your PEP. This will also enable better merging techniques due to the lifting of the type/class difference. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
[pre-PEP]
You have a lot of good points in there (also some inaccuracies) and I agree that Python should move to using Unicode for text data and arrays for binary data.
That's my primary goal. If we can all agree that is the goal then we can start to design new features with that mind. I'm overjoyed to have you on board. I'm pretty sure Fredrick agrees with the goals (probably not every implementation detail). I'll send to i18n sig and see if I can get buy-in from Andy Robinson et. al. Then it's just Guido.
Some things you may be missing though is that Python already has support for a few features you mention, e.g. codecs.open() provide more or less what you have in mind with fopen() and the compiler can already unify Unicode and string literals using the -U command line option.
The problem with unifying string literals without unifying string *types* is that many functions probably check for and type("") not type(u"").
What you don't talk about in the PEP is that Python's stdlib isn't even Unicode aware yet, and whatever unification steps we take, this project will have to preceed it.
I'm not convinced that is true. We should be able to figure it out quickly though.
The problem with making the stdlib Unicode aware is that of deciding which parts deal with text data or binary data -- the code sometimes makes assumptions about the nature of the data and at other times it simply doesn't care.
Can you give an example? If the new string type is 100% backwards compatible in every way with the old string type then the only code that should break is silly code that did stuff like: try: something = chr( somethingelse ) except ValueError: print "Unicode is evil!" Note that I expect types.StringType == types(chr(10000)) etc.
In this light I think you ought to focus Python 3k with your PEP. This will also enable better merging techniques due to the lifting of the type/class difference.
Python3K is a beautiful dream but we have problems we need to solve today. We could start moving to a Unicode future in baby steps right now. Your "open" function could be moved into builtins as "fopen". Python's "binary" open function could be deprecated under its current name and perhaps renamed. The sooner we start the sooner we finish. You and /F laid some beautiful groundwork. Now we just need to keep up the momentum. I think we can do this without a big backwards compatibility earthquake. VB and TCL figured out how to do it... Paul Prescod
Paul Prescod wrote:
"M.-A. Lemburg" wrote:
[pre-PEP]
You have a lot of good points in there (also some inaccuracies) and I agree that Python should move to using Unicode for text data and arrays for binary data.
That's my primary goal. If we can all agree that is the goal then we can start to design new features with that mind. I'm overjoyed to have you on board. I'm pretty sure Fredrick agrees with the goals (probably not every implementation detail). I'll send to i18n sig and see if I can get buy-in from Andy Robinson et. al. Then it's just Guido.
Oh, I think that everybody agrees on moving to Unicode as basic text storage container. The question is how to get there ;-) Today we are facing a problem in that strings are also used as containers for binary data and no distinction is made between the two. We also have to watch out for external interfaces which still use 8-bit character data, so there's a lot ahead.
Some things you may be missing though is that Python already has support for a few features you mention, e.g. codecs.open() provide more or less what you have in mind with fopen() and the compiler can already unify Unicode and string literals using the -U command line option.
The problem with unifying string literals without unifying string *types* is that many functions probably check for and type("") not type(u"").
Well, with -U on, Python will compile "" into u"", so you can already test Unicode compatibility today... last I tried, Python didn't even start up :-(
What you don't talk about in the PEP is that Python's stdlib isn't even Unicode aware yet, and whatever unification steps we take, this project will have to preceed it.
I'm not convinced that is true. We should be able to figure it out quickly though.
We can use that knowledge to base future design upon. The problem with many stdlib modules is that they don't make a difference between text and binary data (and often can't, e.g. take sockets), so we'll have to figure out a way to differentiate between the two. We'll also need an easy-to-use binary data type -- as you mention in the PEP, we could take the old string implementation as basis and then perhaps turn u"" into "" and use b"" to mean what "" does now (string object).
The problem with making the stdlib Unicode aware is that of deciding which parts deal with text data or binary data -- the code sometimes makes assumptions about the nature of the data and at other times it simply doesn't care.
Can you give an example? If the new string type is 100% backwards compatible in every way with the old string type then the only code that should break is silly code that did stuff like:
try: something = chr( somethingelse ) except ValueError: print "Unicode is evil!"
Note that I expect types.StringType == types(chr(10000)) etc.
Sure, but there are interfaces which don't differentiate between text and binary data, e.g. many IO-operations don't care about what exactly they are writing or reading. We'd probably define a new set of text data APIs (meaning methods) to make this difference clear and visible, e.g. .writetext() and .readtext().
In this light I think you ought to focus Python 3k with your PEP. This will also enable better merging techniques due to the lifting of the type/class difference.
Python3K is a beautiful dream but we have problems we need to solve today. We could start moving to a Unicode future in baby steps right now. Your "open" function could be moved into builtins as "fopen". Python's "binary" open function could be deprecated under its current name and perhaps renamed.
Hmm, I'd prefer to keep things separate for a while and then switch over to new APIs once we get used to them.
The sooner we start the sooner we finish. You and /F laid some beautiful groundwork. Now we just need to keep up the momentum. I think we can do this without a big backwards compatibility earthquake. VB and TCL figured out how to do it...
... and we should probably try to learn from them. They have put a considerable amount of work into getting the low-level interfacing issues straight. It would be nice if we could avoid adding more conversion magic... -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
...
Oh, I think that everybody agrees on moving to Unicode as basic text storage container.
The last time we went around there was an anti-Unicode faction who argued that adding Unicode support was fine but making it the default would inconvenience Japanese users.
... Well, with -U on, Python will compile "" into u"", so you can already test Unicode compatibility today... last I tried, Python didn't even start up :-(
I'm going to say again that I don't see that as a test of Unicode-compatibility. It is a test of compatibility with our existing Unicode object. If we simply allowed string objects to support higher character numbers I *cannot see* how that could break existing code.
... We can use that knowledge to base future design upon. The problem with many stdlib modules is that they don't make a difference between text and binary data (and often can't, e.g. take sockets), so we'll have to figure out a way to differentiate between the two. We'll also need an easy-to-use binary data type -- as you mention in the PEP, we could take the old string implementation as basis and then perhaps turn u"" into "" and use b"" to mean what "" does now (string object).
I agree that we need all of this but I strongly disagree that there is any dependency relationship between improving the Unicode-awareness of I/O routines (sockets and files) and allowing string objects to support higher character numbers. I claim that allowing higher character numbers in strings will not break socket objects. It might simply be the case that for a while socket objects never create these higher charcters. Similarly, we could improve socket objects so that they have different readtext/readbinary and writetext/writebinary without unifying the string objects. There are lots of small changes we can make without breaking anything. One I would like to see right now is a unification of chr() and unichr(). We are just making life harder for ourselves by walking further and further down one path when "everyone agrees" that we are eventually going to end up on another path.
... It would be nice if we could avoid adding more conversion magic...
We already have more "magic" in our conversions than we need. I don't think I'm proposing any new conversions. Paul Prescod
If we simply allowed string objects to support higher character numbers I *cannot see* how that could break existing code.
To take a specific example: What would you change about imp and py_compile.py? What is the type of imp.get_magic()? If character string, what about this fragment? import imp MAGIC = imp.get_magic() def wr_long(f, x): """Internal; write a 32-bit int to a file in little-endian order.""" f.write(chr( x & 0xff)) f.write(chr((x >> 8) & 0xff)) f.write(chr((x >> 16) & 0xff)) f.write(chr((x >> 24) & 0xff)) ... fc = open(cfile, 'wb') fc.write('\0\0\0\0') wr_long(fc, timestamp) fc.write(MAGIC) Would that continue to write the same file that the current version writes?
We are just making life harder for ourselves by walking further and further down one path when "everyone agrees" that we are eventually going to end up on another path.
I think a problem of discussing on a theoretical level is that the impact of changes is not clear. You seem to claim that you want changes that have zero impact on existing programs. Can you provide a patch implementing these changes, so that others can experiment and find out whether their application would break? Regards, Martin
Let me say one more thing. Unicode and string types are *already widely interoperable*. You run into problems: a) when you try to convert a character greater than 128. In my opinion this is just a poor design decision that can be easily reversed b) some code does an explicit check for types.StringType which of course is not compatible with types.UnicodeType. This can only be fixed by merging the features of types.StringType and types.UnicodeType so that they can be the same object. This is not as trivial as the other fix in terms of lines of code that must change but conceptually it doesn't seem complicated at all. I think a lot of Unicode interoperability problems would just go away if "a" was fixed... Paul Prescod
a) when you try to convert a character greater than 128. In my opinion this is just a poor design decision that can be easily reversed
Technically, you can easily convert expand it to 256; not that easily beyond. Then, people who put KOI8-R into their Python source code will complain why the strings come out incorrectly, even though they set their language to Russion, and even though it worked that way in earlier Python versions. Or, if they then tag their sources as KOI8-R, writing strings to a "plain" file will fail, as they have characters > 256 in the string.
I think a lot of Unicode interoperability problems would just go away if "a" was fixed...
No, that would be just open a new can of worms. Again, provide a specific patch, and I can tell you specific problems. Regards, Martin
martin wrote:
To take a specific example: What would you change about imp and py_compile.py? What is the type of imp.get_magic()? If character string, what about this fragment?
import imp MAGIC = imp.get_magic()
def wr_long(f, x): """Internal; write a 32-bit int to a file in little-endian order.""" f.write(chr( x & 0xff)) f.write(chr((x >> 8) & 0xff)) f.write(chr((x >> 16) & 0xff)) f.write(chr((x >> 24) & 0xff)) ... fc = open(cfile, 'wb') fc.write('\0\0\0\0') wr_long(fc, timestamp) fc.write(MAGIC)
Would that continue to write the same file that the current version writes?
yes (file opened in binary mode, no encoding, no code points above 255) Cheers /F
The last time we went around there was an anti-Unicode faction who argued that adding Unicode support was fine but making it the default would inconvenience Japanese users. Whoops, I nearly missed the biggest debate of the year!
I guess the faction was Brian and I, and our concerns were misunderstood. We can lay this to rest forever now as the current implementation and forward direction incorporate everything I originally hoped for: (1) Frequently you need to work with byte arrays, but need a rich bunch of string-like routines - search and replace, regex etc. This applies both to non-natural-language data and also to the special case of corrupt native encodings that need repair. We loosely defined the 'string interface' in UserString, so that other people could define string-like types if they wished and so that users can expect to find certain methods and operations in both Unicode and Byte Array types. I'd be really happy one day to explicitly type x= ByteArray('some raw data') as long as I had my old friends split, join, find etc. (2) Japanese projects often need small extensions to codecs to deal with user-defined characters. Java and VB give you some canned codecs but no way to extend them. All the Python asian codec drafts involve 'open' code you can hack and use simple dictionaries for mapping tables; so it will be really easy to roll your own "Shift-JIS-plus" with 20 extra characters mapping to a private use area. This will be a huge win over other languages. (3) The Unicode conversion was based on a more general notion of 'stream conversion filters' which work with bytes. This leaves the door open to writing, for example, a direct Shift-JIS-to-EUC filter which adds nothing in the case of clean data but is much more robust in the case of user-defined characters or which can handle cleanup of misencoded data. We could also write image manipulation or crypto codecs. Some of us hope to provide general machinery for fast handling of byte-stream-filters which could be useful in image processing and crypto as well as encodings. This might need an extended or different lookup function (after all, neither end of the filter need be Unicode) but could be cleanly layered on top of the codec mechanism we have built in. (4) I agree 100% on being explicit whenever you do I/O or conversion and on generally using Unicode characters where possible. Defaults are evil. But we needed a compatibility route to get there. Guido has said that long term there will be Unicode strings and Byte Arrays. That's the time to require arguments to open().
Similarly, we could improve socket objects so that they have different readtext/readbinary and writetext/writebinary without unifying the string objects. There are lots of small changes we can make without breaking anything. One I would like to see right now is a unification of chr() and unichr().
Here's a thought. How about BinaryFile/BinarySocket/ByteArray which do not need an encoding, and File/Socket/String which require explicit encodings on opeening. We keep broad parity between their methods. That seems more straightforward to me than having text/binary methods, and also provides a cleaner upgrade path for existing code. - Andy
Here's a thought. How about BinaryFile/BinarySocket/ByteArray which do
Files and sockets often contain a both string and binary data. Having StringFile and BinaryFile seems the wrong split. I'd think being able to write string and binary data is more useful for example having methods on file and socket like file.writetext, file.writebinary. NOw I can use the writetext to write the HTTP headers and writebinary to write the JPEG image say. BArry
[Paul Prescod discusses Unicode enhancements to Python] Another approach being pursued, mostly in Japan, is Multilingualization (M17N), http://www.m17n.org/ This is supported by the appropriate government department (MITI) and is being worked on in some open source projects, most notably Ruby. For some messages from Yukihiro Matsumoto search deja for M17N in comp.lang.ruby. Matz: "We don't believe there can be any single characer-encoding that encompasses all the world's languages. We want to handle multiple encodings at the same time (if you want to)." The approach taken in the next version of Ruby is for all string and regex objects to have an encoding attribute and for there to be infrastructure to handle operations that combine encodings. One of the things that is needed in a project that tries to fulfill the needs of large character set users is to have some of those users involved in the process. When I first saw proposals to use Unicode in products at Reuters back in 1994, it looked to me (and the proposal originators) as if it could do everything anyone ever needed. It was only after strenuous and persistant argument from the Japanese and Hong Kong offices that it became apparent that Unicode just wasn't enough. A partial solution then was to include language IDs encoded in the Private Use Area. This was still being discussed when I left but while it went some way to satisfying needs, there was still some unhappiness. If Python could cooperate with Ruby here, then not only could code be shared but Python would gain access to developers with large character set /needs/ and experience. Neil
Neil Hodgson wrote:
Matz: "We don't believe there can be any single characer-encoding that encompasses all the world's languages. We want to handle multiple encodings at the same time (if you want to)."
neither does the unicode designers, of course: the point is that unicode only deals with glyphs, not languages. most existing japanese encodings also include language info, and if you don't understand the difference, it's easy to think that unicode sucks... I'd say we need support for *languages*, not more internal encodings. Cheers /F
Fredrik Lundh wrote:
Neil Hodgson wrote:
Matz: "We don't believe there can be any single characer-encoding that encompasses all the world's languages. We want to handle multiple encodings at the same time (if you want to)."
neither does the unicode designers, of course: the point is that unicode only deals with glyphs, not languages.
most existing japanese encodings also include language info, and if you don't understand the difference, it's easy to think that unicode sucks...
I'd say we need support for *languages*, not more internal encodings.
print "Hello World!".encode('ascii','German') Hallo Welt!
Nice thought ;-) Seriously, do you think that these issues are solvable at the programming language level ? I think that the information needed to fully support language specific notations is much too complicated to go into the Python core. This should be left to applications and add-on packages to figure out. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Neil Hodgson wrote:
...
Matz: "We don't believe there can be any single characer-encoding that encompasses all the world's languages. We want to handle multiple encodings at the same time (if you want to)."
The approach taken in the next version of Ruby is for all string and regex objects to have an encoding attribute and for there to be infrastructure to handle operations that combine encodings.
I think Python should support as many encodings as people invent. Conceptually it doesn't cost me anything, but I'll leave the implementation to you. :) But an encoding is only a way of *representing a character in memory or on disk*. Asking for Python to support multiple encodings in memory is like asking for it to support both two's complement and one's complement long integers. Multiple encodings can be only interesting as a performance issue because the encoding of memory is *transparent* to the *Python programmer*. We could support a thousand encodings internally but a Python programmer should never know or care which one they are dealing with. Which leads me to ask "what's the point"? Would the small performance gains be worth it?
One of the things that is needed in a project that tries to fulfill the needs of large character set users is to have some of those users involved in the process. When I first saw proposals to use Unicode in products at Reuters back in 1994, it looked to me (and the proposal originators) as if it could do everything anyone ever needed. It was only after strenuous and persistant argument from the Japanese and Hong Kong offices that it became apparent that Unicode just wasn't enough. A partial solution then was to include language IDs encoded in the Private Use Area. This was still being discussed when I left but while it went some way to satisfying needs, there was still some unhappiness.
I think that Unicode has changed quite a bit since 1994. Nevertheless, language IDs is a fine solution. Unicode is not about distinguishing between languages -- only characters. There is no better "non-Unicode" solution that I've ever heard of.
If Python could cooperate with Ruby here, then not only could code be shared but Python would gain access to developers with large character set /needs/ and experience.
I don't see how we could meaningfully cooperate on such a core language issue. We could of course share codecs but that has nothing to do with Python's internal representation. Paul Prescod
On Wed, Feb 07, 2001 at 12:49:15PM -0800, Paul Prescod quoted:
The approach taken in the next version of Ruby is for all string and regex objects to have an encoding attribute and for there to be infrastructure to handle operations that combine encodings.
Any idea if this next version of Ruby is available in its current state, or if it's vaporware? It might be worth looking at what exactly it implements, but I wonder if this is just Matz's idea and he hasn't yet tried implementing it.
We could support a thousand encodings internally but a Python programmer should never know or care which one they are dealing with. Which leads me to ask "what's the point"? Would the small performance gains be worth it?
I'd worry that implementing a regex engine for multiple encodings would be impossible or, if possible, it would be quite slow because you'd need to abstract every single character retrieval into a function call that decodes a single character for a given encoding. Massive surgery was required to make Perl handle UTF-8, for example, and I don't know that Perl's engine is actually fully operational with UTF-8 yet. --amk
Andrew Kuchling:
Any idea if this next version of Ruby is available in its current state, or if it's vaporware? It might be worth looking at what exactly it implements, but I wonder if this is just Matz's idea and he hasn't yet tried implementing it.
AFAIK, 1.7 is still vaporware although the impression that I got was this was being implemented by Matz when he mentioned it in mid December. Some code may be available from CVS but I haven't been following that closely.
I'd worry that implementing a regex engine for multiple encodings would be impossible or, if possible, it would be quite slow because you'd need to abstract every single character retrieval into a function call that decodes a single character for a given encoding.
<speculation> I'd guess at some sort of type promotion system with caching to avoid extra conversions. Say you want to search a Shift-JIS string for a KOI8 string (unlikely but they do share many characters). The infrastructure checks the character sets representable in the encodings and chooses a super-type that can include all possibilities in the expression, then promotes both arguments by reencoding and performs the operation. The super-type would likely be Unicode based although given Matz' desire for larger-than-Unicode character sets, it may be something else. </speculation> Neil
participants (9)
-
Andrew Kuchling
-
Andy Robinson
-
Barry Scott
-
Fredrik Lundh
-
Fredrik Lundh
-
M.-A. Lemburg
-
Martin v. Loewis
-
Neil Hodgson
-
Paul Prescod