Removing the implicit str() call from printing API
So far, noone has commented on this idea.
I would like to go ahead and check in patch which passes through Unicode objects to the file-object's .write() method while leaving the standard str() call for all other objects in place.
I'm behind this in principle. Here's an example of why:
tokyo_utf8 = "??" # the kanji for Tokyo, trust me... print tokyo_utf8 # this is 8-bit and prints fine æ±äº¬ tokyo_uni = codecs.utf_8_decode(tokyo_utf8)[0] print tokyo_uni # try to print the kanji Traceback (innermost last): File "<interactive input>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128)
Let's say I am generating HTML pages and working with Unicode strings containing data > 127. It is far more natural to write a lot of print statements than to have to (a) concatenate all my strings or (b) do this on every line that prints something: print tokyo_utf8.encode(my_encoding) We could trivially make a file object which knows to convert the output to, say, Shift-JIS, or even redirect sys.stdout to such an object. Then we could just print Unicode strings to it. Effectively, the decision on whether a string is printable is deferred to the printing device. I think this is a good pattern which encourages people to work in Unicode. I know nothing of the Python internals and cannot help weigh up how serious the breakage is, but it would be a logical feature to add. - Andy Robinson
On Sat, 10 Feb 2001, Andy Robinson wrote:
So far, noone has commented on this idea.
I would like to go ahead and check in patch which passes through Unicode objects to the file-object's .write() method while leaving the standard str() call for all other objects in place.
I'm behind this in principle. Here's an example of why:
tokyo_utf8 = "??" # the kanji for Tokyo, trust me... print tokyo_utf8 # this is 8-bit and prints fine 東京 tokyo_uni = codecs.utf_8_decode(tokyo_utf8)[0] print tokyo_uni # try to print the kanji Traceback (innermost last): File "<interactive input>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128)
Something like the following looks reasonable to me; the added complexity is that the file object now remembers an encoder/decoder pair in its state (the API might give the appearance of remembering just the codec name, but we want to avoid doing codecs.lookup() on every write), and uses it whenever write() is passed a Unicode object. >>> file = open('outputfile', 'w', 'utf-8') >>> file.encoding 'utf-8' >>> file.write(tokyo_uni) # tokyo_utf8 gets written to file >>> file.close() Open questions: - If an encoding is specified, should file.read() then always return Unicode objects? - If an encoding is specified, should file.write() only accept Unicode objects and not bytestrings? - Is the encoding attribute mutable? (I would prefer not, but then how to apply an encoding to sys.stdout?) Side question: i noticed that the Lib/encodings directory supports quite a few code pages, including Greek, Russian, but there are no ISO-2022 CJK or JIS codecs. Is this just because no one felt like writing one, or is there a reason not to include one? It seems to me it might be nice to include some codecs for the most common CJK encodings -- that recent note on the popularity of Python in Korea comes to mind. -- ?!ng Happiness comes more from loving than being loved; and often when our affection seems wounded it is is only our vanity bleeding. To love, and to be hurt often, and to love again--this is the brave and happy life. -- J. E. Buchrose
Open questions:
- If an encoding is specified, should file.read() then always return Unicode objects?
- If an encoding is specified, should file.write() only accept Unicode objects and not bytestrings?
- Is the encoding attribute mutable? (I would prefer not, but then how to apply an encoding to sys.stdout?)
Right now, codecs.open returns an instance of codecs.StreamReaderWriter, not a native file object. It has methods that look like the ones on a file, but they tpically accept or return Unicode strings instead of binary ones. This feels right to me and is what Java does; if you want to switch encoding on sys.stdout, you are not really doing anything to the file object, just switching the wrapper you use. There is much discussion on the i18n sig about 'unifying' binary and Unicode strings at the moment.
Side question: i noticed that the Lib/encodings directory supports quite a few code pages, including Greek, Russian, but there are no ISO-2022 CJK or JIS codecs. Is this just because no one felt like writing one, or is there a reason not to include one? It seems to me it might be nice to include some codecs for the most common CJK encodings -- that recent note on the popularity of Python in Korea comes to mind.
There have been 3 contributions to Asian codecs on the i18n sig in the last six months (pythoncodecs.sourceforge.net) one C, two J and one K - but some authors are uncomfortable with Python-style licenses. They need tying together into one integrated package with a test suite. After a 5-month-long project which tied me up, I have finally started ooking at this. The general feeling was that the Asian codecs package should be an optional download, but if we can get them fully tested and do some compression magic it would be nice to get them in the box one day. - Andy Robinson
participants (2)
-
Andy Robinson
-
Ka-Ping Yee