<div dir="ltr"><div class="gmail_quote">On Sun, Feb 12, 2012 at 2:54 PM, Paul Moore <span dir="ltr"><<a href="mailto:p.f.moore@gmail.com">p.f.moore@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On 12 February 2012 05:03, Steven D'Aprano <<a href="mailto:steve@pearwood.info">steve@pearwood.info</a>> wrote:<br>

> Paul Moore wrote:<br>

><br>

>> My concern about Unicode in Python 3 is that the principle is, you<br>

>> specify the right encoding. But often, I don't *know* the encoding ;-(<br>

>> Text files, like changelogs as a good example, generally have no<br>

>> marker specifying the encoding, and they can have all sorts (depending<br>

>> on where the package came from). Worse, I am on Windows and changelogs<br>

>> usually come from Unix developers - so I'm not familiar with the<br>

>> common conventions ("well, of course it's in UTF-8, that's what<br>

>> everyone uses"...)<br>

><br>

><br>

> <raises eyebrow><br>

><br>

> But you obviously do know the convention -- use UTF-8.<br>

<br>

</div>No. I know that a lot of Unix people advocate UTF-8, and I gather it's<br>

rapidly becoming standard in the Unix world. But I work on Windows,<br>

and UTF-8 is not the standard there. I have no idea if UTF-8 is<br>

accepted cross-platform, or if it's just what has grown as most<br>

ChangeLog files are written on Unix and Unix users don't worry about<br>

what's convenient on Windows (no criticism there, just acknowledgement<br>

of a fact). And I have seen ChangeLog files with non-UTF-8 encodings<br>

of names in them. I have no idea if that's a bug or just a preference<br>

- and anyway, "be permissive in what you accept" applies...<br><br></blockquote><div><br></div><div>Windows NT started with UCS-16 and from Windows 2000 it's UTF-16 internally. It was an uplifting thought that unicode is just 2 bytes per letter so they did a huge refactoring of the entire windows API (ReadFileA/ReadFileW etc) thinking they won't have to worry about it again. Nowadays windows INTERNALS have the worst of all worlds - a variable char-length, uncommon unicode format, and twice the API to maintain.</div>

<div><br></div><div>Notepad can open and save utf-8 files perfectly much like most other windows programs.</div><div><br></div><div>UTF-8 is the internet standard and I suggest we keep that fact crystal clear. UTF-8 Is the goto codec, it is the convention.</div>

<div><br></div><div>It's ok to use other codecs for whatever reasons, constraints, use cases, etc. But these are all exceptions to the convention - UTF8.</div><div><br></div><div><br></div><div>Yuval (Also a windows dev)</div>

</div></div>