[issue20686] Confusing statement

New submission from Daniel U. Thibault: Near the end of 3.1.3 http://docs.python.org/2/tutorial/introduction.html#unicode-strings you can read: "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding." This can be interpreted as stating that stating that printing a Unicode string (using the print function or the shell's default print behaviour) results in ASCII printout. It can likewise be interpreted as stating that any write of a Unicode string to a file converts the string to ASCII. Experimentation shows this is not true. Perhaps you meant something like this: "When a Unicode string is converted with str() in order to be printed or written to a file, conversion takes place using this default encoding." Grammatical comments: In the statement "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.", the ", or" puts the three elements of the enumeration on the same level (respectively "printed", "written to a file", and "converted with str()"). The confusion seems to arise because "with str()" was meant to apply to the list as a whole, not just its last element. ---------- assignee: docs@python components: Documentation messages: 211627 nosy: Daniel.U..Thibault, docs@python priority: normal severity: normal status: open title: Confusing statement type: enhancement _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

R. David Murray added the comment: It seems to me the statement is correct as written. What experiments indicate otherwise? ---------- nosy: +r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

Georg Brandl added the comment: The only problem I can see is that "print" uses the console encoding. For files and str(), the comment is correct for Python 2. ---------- nosy: +georg.brandl _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

Daniel U. Thibault added the comment: "It seems to me the statement is correct as written. What experiments indicate otherwise?" Here's a simple one:
print «1»
The guillemets are certainly not ASCII (Unicode AB and BB, well outside ASCII's 7F upper limit) but are rendered as guillemets. (Guillemets are easy for me 'cause I use a French keyboard) I haven't actually checked yet what happens when writing to a file. If Python is unable to write anything but ASCII to file, it becomes nearly useless. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

R. David Murray added the comment: Thanks, yes, Georg already pointed out the issue with print. I suppose that this is something that changed at some point in Python2's history but this bit of the docs was not updated. Python can write anything to a file, you just have to tell it what encoding to use, either by explicitly encoding the unicode to binary before writing it to the file, or by using codecs.open and specifying an encoding for the file. (This is all much easier in python3, where the unicode support is part of the core of the language.) ---------- versions: +Python 2.7 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

Daniel U. Thibault added the comment: "The default encoding is normally set to ASCII [...]. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding."
u"äöü" u'\xe4\xf6\xfc' Printing a Unicode string uses ASCII encoding: false (the characters are not converted to their ASCII equivalents) (compare with str(), below)
str(u"äöü") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) Converting a Unicode string with str() uses ASCII encoding: true (if print (see above) behaved like str(), you'd get an error too)
f = open('workfile', 'w') f.write('This is a «test»\n') f.close() Writing a Unicode string to a file uses ASCII encoding: false (examination of the file reveals UTF-8 characters (hex dump: 54 68 69 73 20 69 73 20 61 20 C2 AB 74 65 73 74 C2 BB 0A))
---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

R. David Murray added the comment: re: file. You forgot the 'u' in front of the string:
f.write(u'This is a «test»\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 10: ordinal not in range(128)
So you were actually writing binary in your console encoding, which must have been utf-8. (This kind of confusion is the main reason python3 exists). ---------- title: Confusing statement -> Confusing statement about unicode strings in tutorial introduction _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

Daniel U. Thibault added the comment:
mystring="äöü" myustring=u"äöü"
mystring '\xc3\xa4\xc3\xb6\xc3\xbc' myustring u'\xe4\xf6\xfc'
str(mystring) '\xc3\xa4\xc3\xb6\xc3\xbc' str(myustring) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
f = open('workfile', 'w') f.write(mystring) f.close() f = open('workufile', 'w') f.write(myustring) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) f.close()
workfile contains C3 A4 C3 B6 C3 BC So the Unicode string (myustring) does indeed try to convert to ASCII when written to file. But not when just printed. It seems really strange that non-Unicode strings (mystring) should actually be more flexible than Unicode strings... ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

Georg Brandl added the comment: First, entering a string at the command prompt like this is not considered "printing"; it's invoking the repr(). Then, when you say flexible, you say it as if it's a good thing. In this context "flexible" means as much as "easy to produce mojibake" and is not desirable. For all these use cases, there are ways to do the right thing with Unicode strings in Python 2 (e.g. using io.open instead of builtin open). But making these the builtin case was the big gain of Python 3. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue20686> _______________________________________

Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: Python 2.7 is no longer supported. ---------- nosy: +serhiy.storchaka resolution: -> out of date stage: -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue20686> _______________________________________
participants (4)
-
Daniel U. Thibault
-
Georg Brandl
-
R. David Murray
-
Serhiy Storchaka