[Python-ideas] Py3 unicode impositions
ubershmekel at gmail.com
Sun Feb 12 14:33:42 CET 2012
On Sun, Feb 12, 2012 at 2:54 PM, Paul Moore <p.f.moore at gmail.com> wrote:
> On 12 February 2012 05:03, Steven D'Aprano <steve at pearwood.info> wrote:
> > Paul Moore wrote:
> >> My concern about Unicode in Python 3 is that the principle is, you
> >> specify the right encoding. But often, I don't *know* the encoding ;-(
> >> Text files, like changelogs as a good example, generally have no
> >> marker specifying the encoding, and they can have all sorts (depending
> >> on where the package came from). Worse, I am on Windows and changelogs
> >> usually come from Unix developers - so I'm not familiar with the
> >> common conventions ("well, of course it's in UTF-8, that's what
> >> everyone uses"...)
> > <raises eyebrow>
> > But you obviously do know the convention -- use UTF-8.
> No. I know that a lot of Unix people advocate UTF-8, and I gather it's
> rapidly becoming standard in the Unix world. But I work on Windows,
> and UTF-8 is not the standard there. I have no idea if UTF-8 is
> accepted cross-platform, or if it's just what has grown as most
> ChangeLog files are written on Unix and Unix users don't worry about
> what's convenient on Windows (no criticism there, just acknowledgement
> of a fact). And I have seen ChangeLog files with non-UTF-8 encodings
> of names in them. I have no idea if that's a bug or just a preference
> - and anyway, "be permissive in what you accept" applies...
Windows NT started with UCS-16 and from Windows 2000 it's UTF-16
internally. It was an uplifting thought that unicode is just 2 bytes per
letter so they did a huge refactoring of the entire windows API
(ReadFileA/ReadFileW etc) thinking they won't have to worry about it again.
Nowadays windows INTERNALS have the worst of all worlds - a variable
char-length, uncommon unicode format, and twice the API to maintain.
Notepad can open and save utf-8 files perfectly much like most other
UTF-8 is the internet standard and I suggest we keep that fact crystal
clear. UTF-8 Is the goto codec, it is the convention.
It's ok to use other codecs for whatever reasons, constraints, use cases,
etc. But these are all exceptions to the convention - UTF8.
Yuval (Also a windows dev)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-ideas