[Python-Dev] Unicode print/stdoutput exceptions are not nice

Thu Jan 12 11:18:43 CET 2006

<1137059888.538391.119110 at o13g2000cwo.googlegroups.com>
Martin v. Löwis schrieb:

 > Robert wrote:
 > > is in a PythonWin Interactive session - ok results for cyrillic chars
 > > (tolerant mbcs/utf-8 encoding!).
 > > But if I do this on Win console (as you probably mean), I get also
 > > encoding Errors - no matter if chcp1251, because cyrillic chars raise
 > > the encoding errors also.
 >
 > If you do "chcp 1251" (not "chcp1251") in the console, and then
 > run python.exe in the same console, what is the value of
 > sys.stdout.encoding?

correctly: 'cp1252' in my case; cyrillic-chars break "print"   (on PC 
linux 2.2 tty py24 sys.stdout.encoding is None (?); only 
locale.getdef... useful? )

I live with this in site(customize):

import codecs
_stdout=sys.stdout
if sys.platform=='win32' and not 
sys.modules.has_key('pywin.framework.startup'):
     _stdoutenc=getattr(_stdout,'encoding',None) or sys.getdefaultencoding()
     try: codecs.lookup(_stdoutenc)
     except LookupError: _stdoutenc=sys.getdefaultencoding()
     class StdOut:
         def write(self,s): 
_stdout.write(s.encode(_stdoutenc,'backslashreplace'))
     sys.stdout=StdOut()
elif sys.platform.startswith('linux'):
     import locale
     _stdoutenc=locale.getdefaultlocale()[1]
     try: codecs.lookup(_stdoutenc)
     except LookupError: _stdoutenc=sys.getdefaultencoding()
     class StdOut:
         def write(self,s): 
_stdout.write(s.encode(_stdoutenc,'backslashreplace'))
     sys.stdout=StdOut()

 > > I think this is not a good behaviour of python to be so picky.
 >
 > I think it it is good.
 >
 >    Errors should never pass silently.
 >    Unless explicitly silenced.

A political question. Arguments:

* Webbrowsers for example have to display defective HTML as good as 
possible, unknown unicode chars as "?" and so on... Users got very angry 
in the beginning of browsers when 'strict' programmers displayed their 
exception error boxes ...

* at least the "print" statement has to go through - the costs (for 
angry users and developers; e.g. 
http://blog.ianbicking.org/do-i-hate-unicode-or-do-i-hate-ascii.html) 
are much higher when apps suddenly break in simple print/display-output 
when the system picks up alien unicode chars somewhere (e.g. 
occasionally in filenames,...). No one is really angry when occasionally 
chinese chars are displayed cryptically on non-chinese computers. One 
can investigate, add fonts, ... to improve, or do nothing in most cases, 
but apps do not break on every print statement! This is not only true 
for tty-output, but also for log-file redirect and almost any common 
situation for print/normal stdout/file-(write)-output.

* anything is nice-printable in python by default, why not 
unicode-strings!? If the decision for default 'strict' encoding on 
stdout stands, we have at least to discuss about print-repr for unicode.

* the need for having technical strings 'strict' is much more rare. And 
programmers are anyway very aware in such situations . e.g. by 
asciifile.write( us.encode(xy,'strict') )   .

* on Windows for example the (good) mbcs_encode is anyway tolerant as 
it: unkown chars are mapped to '?' . I never had any objection to this.

Some recommendations - soft to hard:

* make print-repr for unicode strings tolerant  (and in PythonWin alwasy 
tolerant 'mbcs' encoding)

* make stdout/files to have 'replace'-mode encoding by default. (similar 
as done with my code above)

* set site.py/encoding=('ascii', 'replace')   # if not utf-8/mbcs/locale 
;enable a tuple
* save sys._setdefaultencoding by default

* I would also live perfectly with .encode(enc) to run 'replace' by 
default, and 'strict' on demand. None of my apps and scripts would break 
because of this, but win. A programmer is naturally very aware when he 
wants 'strict'. Can you name realistic cases where 'replace' behavior 
would be so critical that a program damages something?

 > > In
 > > 1136925967.990106.299760 at g44g2000cwa.googlegroups.com I showed, how I
 > > solved this so far. Any better/portable idea?
 >
 > Not sure why you aren't using sys.stdout.encoding on Linux. I would do
 >
 > try:
 >   c = codecs.getwriter(sys.stdout.encoding)
 > except:
 >   c = codecs.getwriter('ascii')
 > sys.stdout = c(sys.stdout, 'replace')
 >
 > Also, I wouldn't edit site.py, but instead add sitecustomize.py.

I have more problems with the shape of sys.path in different situations, 
multiple sitecustomize.py on other apps, environments, OS / users, 
cxfreeze,py2exe  ...   sitecustomize not stackable easily: a horror 
solution. The need is for a callable _function_ or for general change in 
python behaviour.

modifiying site.py is better and stable for me (I have my 
patch/module-todo-list handy each time i install a new python), as I 
always want tolerant behaviour. in code i check for 
site.encoding/_setdefaultencoding (I save this). Thus i get one central 
error if setup is not correct, but not evil unicode-errors somewhere 
deep in the app once on a russian computer in the future...

 > > Yes. But the original problem is, that occasionally unicode strings
 > > (filenames in my case) arise which are not defined in the local
 > > platform encodings, but have to be displayed (in 'replace' encoding 
mode)
 > > without breaking the app flow. Thats the pain of the default behaviour
 > > of current python - and there is no simple switch. Why should "print
 > > xy" not print something _always_ as good and as far as possible?
 >
 > Because the author of the application wouldn't know that there
 > is a bug in the application, and that information was silently
 > discarded. Users might only find out much later that they have
 > question marks in places where users originally entered data,
 > and they would have no way of retrieving the original data.
 >
 > If you can accept that data loss: fine, but you should silence
 > the errors explicitly.

this is black/white theoretical - not real and practical (as python 
wants to be). see above.

Robert