<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 2 June 2014 11:14, Steven D'Aprano <span dir="ltr"><<a href="mailto:steve+comp.lang.python@pearwood.info" target="_blank">steve+comp.lang.python@pearwood.info</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="">On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:<br>

</div><div class="">

> I'm currently working on a product that interacts with lots of other<br>

> products. These other products can be using any encoding - but most of<br>

> the functions that interact with I/O assume the system default encoding<br>

> of the machine that is collecting the data. The product has been in<br>

> production for nearly a decade, so there's a lot of pushback against<br>

> changes deep in the code for fear that it will break working systems.<br>

> The fact that they are working largely by accident appears to escape<br>

> them ...<br>

><br>

> FWIW, changing to use iso-latin-1 by default would be the most sensible<br>

> option (effectively treating everything as bytes), with the option for<br>

> another encoding if/when more information is known (e.g. there's often a<br>

> call to return the encoding, and the output of that call is guaranteed<br>

> to be ASCII).<br>

<br>

</div>Python 2 does what you suggest, and it is *broken*. Python 2.7 creates<br>

moji-bake, while Python 3 gets it right:<br></blockquote><div><br></div><div>The purpose of my example was to show a case where no thought was put into encodings - the assumption was that the system encoding and the remote system encoding would be the same. This is most definitely not the case a lot of the time.</div>

<div><br></div><div>I also should have been more clear that *in the particular situation I was talking about* iso-latin-1 as default would be the right thing to do, not in the general case. Quite often we won't know the correct encoding until we've executed a command via ssh - iso-latin-1 will allow us to extract the info we need (which will generally be 7-bit ASCII) without the possibility of an invalid encoding. Sure we may get mojibake, but that's better than the alternative when we don't yet know the correct encoding.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Latin-1 is one of those legacy encodings which needs to die, not to be<br>

entrenched as the default. My terminal uses UTF-8 by default (as it<br>should), and if I use the terminal to input "δжç", Python ought to see<br>what I input, not Latin-1 moji-bake.<br></blockquote><div><br></div>

<div>For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. It's not the only way, but settling on it and being consistent is better than not having a way.</div>

<div><br></div><div>Tim Delaney </div></div></div></div>