![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem.
Exactly -- but from a practicality beats purity perspective, there is a difference between "I have no idea whatsoever" and "I know it is mostly ascii, and European, but there are some extra characters in there" Latin-1 had proven very useful for that case. I suppose in most cases ascii with errors='replace' would be a good choice, but I'd still rather not throw out potentially useful information.
Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Yeah, I've been very unfocused in this discussion -- sorry about that.
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier. So I think the easy to access, and particularly defaults, numpy string dtypes should match it. It's become clear in this discussion that there is s strong desire to support a numpy dtype that stores text in particular binary formats (I.e. Encodings). Rather than choose one or two, we might as well support all encodings supported by python. In that case, we'll have utf-8 for those that know they want that, and we'll have latin-1 for those that incorrectly think they want that :-) So what remains is to decide is implementation, syntax, and defaults. Let's keep in mind that most of us on this list, and in this discussion, are the folks that write interface code and the like. But most numpy users are not as tuned in to the internals. So defaults should be set to best support the more "naive" user.
. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
If we add every encoding known to man someone is going to use Latin-1 to read unknown encodings. Indeed, as we've all pointed out, there is no correct encoding with which to read unknown encodings. Frankly, if we have UTF-8 under the hood, I think people are even MORE likely to use it inappropriately-- it's quite scary how many people think UTF-8 == Unicode, and think all you need to do is "use utf-8", and you don't need to change any of the rest of your code. Oh, and once you've done that, you can use your existing ASCII-only tests and think you have a working application :-) -CHB