On Fri, May 29, 2015 at 12:57:16PM -0700, Andrew Barnert via Python-ideas wrote:
Before anyone else engages too deeply in this off-topic discussion, some background: Anatoly wrote to python-list asking for help dealing with a problem where he has a bunch of bytes (file names) which probably represent Russian text but in an unknown legacy encoding, and he wants to round-trip it from bytes to Unicode and back again losslessly.
(Russian is a particularly nasty example, because there are multiple mutually-incompatible Russian encodings in widespread use.)
As far as I can see, he has been given the solution, or at least a potential solution, on python-list, but as far as I can tell he either hasn't read it, or doesn't like the solutions offerred and so is ignoring them.
So there's a real problem hidden here, buried beneath the dramatic presentation of imaginary journalists processing text, but I don't think it's a problem that needs discussing here (at least not unless somebody comes up with a concrete proposal or idea to be discussed).
A couple more comments follow:
On May 29, 2015, at 01:56, anatoly techtonik email@example.com wrote:
In Python 2 he just copy pastes regular expressions to match the letter and is happy. In Python 3 he needs to convert that text to unicode.
No he doesn't. In Python 3, unless he goes out of his way to open the file in binary mode, or use binary string literals for his regexps, that text is unicode from the moment his code sees it. So he doesn't have to read the docs.
This is not the case when you have to deal with unknown encodings. And from the perspective of people who only have ASCII (or at worst, Latin-1) text, or who don't care about moji-bake, Python 2 appears easier to work with. To quote Chris Smith:
"I find it amusing when novice programmers believe their main job is preventing programs from crashing. More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn’t crash is a horrible nightmare."
Python 2's string handling is designed to minimize the chance of getting an exception when dealing with text in an unknown encoding, but the consequence is that it also minimizes the chance of it doing the right thing except by accident. In Python 2, you can give me a bunch of arbitrary bytes as a string, and I can read them as text, in a sort of ASCII-ish pseudo-encoding, regardless of how inappropriate it is or how much moji-bake it generates. But it won't raise an exception, which for some people is all that matters.
Moving to Unicode (in Python 2 or 3) can come as a shock to users who have never had to think about this before. Moji-bake is ubiquitous on the Internet, so there is a real problem to be solved. Python 2's string model is not the way to solve it. I don't think there is any "no-brainer" solution which doesn't involve thinking about bytes and encodings, but if Anatoly or anyone else wants to suggest one, we can discuss it.
Now, all those things _are_ still problems for people who use Python
These same issues occur in Python 2 if you exclusively use unicode strings u"" instead of the default string type.
So, unless you have a better solution than Python 3's and also have a time machine to go back to 2007, what could you possibly have to propose?
Surely you would have to go back to 1953 when the ASCII encoding first started, so we can skip over the whole mess of dozens of mutually incompatible "extended ASCII" code pages?