<div dir="ltr">On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <<a href="mailto:shoyer@gmail.com">shoyer@gmail.com</a>> wrote:<br>><br>> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <<a href="mailto:chris.barker@noaa.gov">chris.barker@noaa.gov</a>> wrote:<br>>>><br>>>> On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.<br>>><br>>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't  barf and can be passed on through.<br>><br>> I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters."<br>><br>> My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data?<br><br>I don't know that we can reasonably make that accounting relevant. Number of such characters per byte of text? Number of files with such characters out of all existing files?<div><br></div><div>What I can say with assurance is that every time I have decided, as a developer, to write code that just hardcodes latin-1 for such cases, I have regretted it. While it's just personal anecdote, I think it's at least measuring the right thing. :-)</div><div><br>--<br>Robert Kern</div></div>