Re: [Python-ideas] Py3 unicode impositions

On Fri, Feb 10, 2012 at 3:41 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
> Nor is there in 3.x.
f = open('text-file.txt') for line in f: pass
is an imposition. That doesn't happen in 2.x (for the wrong reasons, but it's very convenient 95% of the time).
I may be missing something, but as best I can tell (1) That uses an implicit encoding of None. (2) encoding=None is documented as being platform-dependent. Are you saying that some (many? all?) platforms make a bad choice there? Does that only happen when sys.getdefaultencoding() != sys.getfilesystemencoding(), or when one of them gives bad information? (FWIW, on a mostly ASCII windows machine, the default is utf-8 but the filesystem encoding is mbcs, so merely being different doesn't always provoke problems.) Would it cause problems to make the default be whatever locale returns, or whatever it returns the first time open is called? -jJ

Jim Jewett writes:
Are you saying that some (many? all?) platforms make a bad choice there?
No. I'm saying that whatever choice is made (except for 'latin-1' because it accepts all bytes regardless of the actual encoding of the data, or PEP 383 "errors='surrogateescape'" for the same reason, both of which are unacceptable defaults for production code *for the same reason*), there is data that will cause that idiom to fail on Python 3 where it would not on Python 2. This is especially the case if you work with older text data on Mac or modern Linux where UTF-8 is used, because you're almost certain to run into Latin-1-encoded files. My favorite example is ChangeLogs, which broke my Gentoo package manager when I experimented with using Python 3 as the default Python. Most packages would work fine, but for some reason some Python program in the PMS was actually reading the ChangeLogs, and sometimes they'd be impure ASCII (I don't recall whether it was utf-8 or latin-1), giving a fatal UnicodeError and everything grinds to a halt. That is reason enough for the naive to embrace fear, uncertainty, and doubt about Python 3's use of Unicode. The fact is that with a little bit of knowledge, you can almost certainly get more reliable (and in case of failure, more debuggable) results from Python 3 than from Python 2. But people are happy to deal with the devil they know, even though it's more noxious than the devil they don't. Counteracting FUD with words generally doesn't work IME, unless the words are a "magic spell" that reduces the unknown to the known.

On 11 February 2012 04:12, Stephen J. Turnbull <stephen@xemacs.org> wrote:
My concern about Unicode in Python 3 is that the principle is, you specify the right encoding. But often, I don't *know* the encoding ;-( Text files, like changelogs as a good example, generally have no marker specifying the encoding, and they can have all sorts (depending on where the package came from). Worse, I am on Windows and changelogs usually come from Unix developers - so I'm not familiar with the common conventions ("well, of course it's in UTF-8, that's what everyone uses"...) In Python 2, I can ignore the issue. Sure, I can end up with mojibake, but for my uses, that's not a disaster. Mostly-readable works. But in Python 3, I get an error and can't process the file. I can just use latin-1, or surrogateescape. But that doesn't come naturally to me yet. Maybe it will in time... Or maybe there's a better solution I don't know about yet. To be clear - I am fully in favour of the Python 3 approach, and I completely support the idea that people should know the encodings of the stuff they are working with (I've seen others naively make encoding mistakes often enough to know that when it matters, it really does matter). But having to worry, not so much about the encoding to use, but rather about the fact that Python is asking you a question you can't answer, is a genuine stumbling block. And from what I've seen, it's at the root of the problems many people have with Unicode in Python 3. I'm not arguing for changes to the default behaviour of Python 3. But if we had a good place to put it, a FAQ entry about "what to do if I need to process a file whose encoding I don't know" would be useful. And certainly having a standard answer that people could give when the question comes up (something practical, not a purist answer like "all files have an encoding, so you should find out") would help. Paul.

On Feb 11, 2012 12:41 PM, "Paul Moore" <p.f.moore@gmail.com> wrote
I think if the bytes type behaved exactly like python2's string it would have been the best option. When you work with "wb" or "rb" you get quite a hint that you're doing it wrong. But devs would have a viable ambiguous *string* type (vs bytes and their integer cells). Yuval

On Feb 11, 2012, at 12:40 AM, Paul Moore wrote:
I'm confused what you're asking for. Setting errors to surrogateescape or encoding to Latin-1 causes Python 3 to behave the exact same way as Python 2: it's doing the "wrong" thing and may result in mojibake, but at least it isn't screwing up anything new so long as the stuff you add to the file is in ASCII. The only way to make Python 3 slightly more like Python 2 would be to set errors="surrogateescape" by default instead of asking the programmer to know to use it. I think that would be going too far, but it could be done. I think it would be simpler though to just publicize errors="surrogateescape" more. "Dear people who don't care about encodings and don't want to take the time to get them right, just put errors='surrogateescape' into your open commands and Python 3 will behave almost exactly like Python 2. The end." Is that really so hard? I'm confused about what else people want.

On Sun, Feb 12, 2012 at 1:19 PM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
An open_ascii() builtin isn't as crazy as it may initially sound - it's not at all uncommon to have a file that's almost certainly in some ASCII compatible encoding like utf-8, latin-1 or one of the other extended ASCII encodings, but you don't know which one specifically. By offering open_ascii(), we'd be making it trivial to process such files without blowing up (or having to figure out exactly *which* ASCII compatible encoding you have). When you wrote them back to disk, if you'd added any non-ASCII chars of your own, you'd get a UnicodeEncodeError, but any encoded data from the original would be reproduced in the original encoding. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
To me, "open_ascii" suggests either: - it opens ASCII files, and raises an error if they are not ASCII; or - it opens non-ASCII files, and magically translates their content to ASCII using some variant of "The Unicode Hammer" recipe: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hamme... We should not be discouraging developers from learning even the most trivial basics of Unicode. I'm not suggesting that we try to force people to become Unicode experts (they wouldn't, even if we tried) but making this a built-in is dumbing things down too much. I don't believe that it is an imposition for people to explicitly use open(filename, 'ascii', 'surrogateescape') if that's what they want. If they want open_ascii, let them define this at the top of their modules: open_ascii = (lambda name: open(name, encoding='ascii', errors='surrogateescape')) A one liner, if you don't mind long lines. I'm not entirely happy with the surrogateescape solution, but I can see it's possibly the least worst *simple* solution for the case where you don't know the source encoding. (Encoding guessing heuristics are awesome but hardly simple.) So put the recipe in the FAQs, in the docs, and the docstring for open[1], and let people copy and paste the recipe. That's a pretty gentle introduction to Unicode. [1] Which is awfully big and complex in Python 3.1, but that's another story. -- Steven

On Sun, Feb 12, 2012 at 3:26 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yeah, it didn't take long for me to come back around to that point of view, so I morphed http://bugs.python.org/issue13997 into a docs bug about clearly articulating the absolute bare minimum knowledge of Unicode needed to process text in a robust cross-platform manner in Python 3 instead. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
+1 I think (as I've said more verbosely elsewhere) that there are two common use cases, corresponding to two different definitions of "robust text processing". (1) Use cases where you would rather risk occasionally corrupting non-ASCII text than risk *any* UnicodeErrors at all *anywhere*. They use encoding='latin-1'. (2) Use cases where you do not want to deal with encodings just to "pass through" non-ASCII text, but do want that text preserved enough to be willing to risk (rare) UnicodeErrors or validation errors from pedantic Unicode-oriented modules. They use encoding='ascii', errors='surrogateescape'.

Paul Moore wrote:
<raises eyebrow> But you obviously do know the convention -- use UTF-8.
So why don't you use UTF-8? As far as those who actually don't know the convention, isn't it better to teach them the convention "use UTF-8, unless dealing with legacy data" rather than to avoid dealing with the issue by using errors='surrogateescape'? I'd hate for "surrogateescape" to become the One Obvious Way for dealing with unknown encodings, because this is 2012 and people should be more savvy about non-ASCII characters by now. I suppose it's marginally better than just throwing them away with errors='ignore', but still. I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312. This isn't entirely a rhetorical question. I'm not on Windows, so perhaps there's a problem I'm unaware of. -- Steven

On 12 February 2012 05:03, Steven D'Aprano <steve@pearwood.info> wrote:
No. I know that a lot of Unix people advocate UTF-8, and I gather it's rapidly becoming standard in the Unix world. But I work on Windows, and UTF-8 is not the standard there. I have no idea if UTF-8 is accepted cross-platform, or if it's just what has grown as most ChangeLog files are written on Unix and Unix users don't worry about what's convenient on Windows (no criticism there, just acknowledgement of a fact). And I have seen ChangeLog files with non-UTF-8 encodings of names in them. I have no idea if that's a bug or just a preference - and anyway, "be permissive in what you accept" applies... Get beyond ChangeLog files and it's anybody's guess. My PC has text files from many, many places (some created on my PC, some created by others on various flavours and ages of Unix , and some downloaded from who-knows-where on the internet). Not one of them comes with an encoding declaration. Of course every file is encoded in some way. But it's incredibly naive to assume the user knows that encoding. Hey, I still have to dump out the content of files to check the line ending convention when working in languages other than Python - universal newlines saves me needing to care about that, why is it so disastrous to consider having something similar for encodings?
Decoding errors.
Fair comment. My point here is that I *am* dealing with "legacy" data in your sense. And I do so on a day to day basis. UTF-8 is very, very rare in my world (Windows). Latin-1 (or something close) is common. There is no cross-platform standard yet. And probably won't be until Windows moves to UTF-8 as the standard encoding. Which ain't happening soon.
I think people are much more aware of the issues, but cross-platform handling remains a hard problem. I don't wish to make assumptions, but your insistence that UTF-8 is a viable solution suggests to me that you don't know much about the handling of Unicode on Windows. I wish I had that luxury...
I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312.
£12 in what encoding? :-)
This isn't entirely a rhetorical question. I'm not on Windows, so perhaps there's a problem I'm unaware of.
I think that's the key here. Even excluding places that don't use the Roman alphabet, Windows encoding handling is complex. CP1252, CP850, Latin-1, Latin-14 (Euro zone), UTF-16, BOMs. All are in use on my PC to some extent. And that's even without all this foreign UTF-8 I get from the Unix guys :-) Apart from the blasted UTF-16, all of it's "ASCII most of the time". Paul.

Paul Moore, 12.02.2012 13:54:
Latin-1, Latin-14 (Euro zone)
OT-remark: I assume you meant ISO8859-15 (aka. Latin-9) here. However, that's not for the "Euro zone", it's just Latin-1 with the Euro character wangled in and a couple of other changes. It still lacks characters that are commonly used by languages within the Euro zone, e.g. the Slovenian language (a Slavic descendant), but also Gaelic or Welsh. https://en.wikipedia.org/wiki/ISO/IEC_8859-15#Coverage https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_commonly_supported_bu... Stefan

On Sun, Feb 12, 2012 at 2:54 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Windows NT started with UCS-16 and from Windows 2000 it's UTF-16 internally. It was an uplifting thought that unicode is just 2 bytes per letter so they did a huge refactoring of the entire windows API (ReadFileA/ReadFileW etc) thinking they won't have to worry about it again. Nowadays windows INTERNALS have the worst of all worlds - a variable char-length, uncommon unicode format, and twice the API to maintain. Notepad can open and save utf-8 files perfectly much like most other windows programs. UTF-8 is the internet standard and I suggest we keep that fact crystal clear. UTF-8 Is the goto codec, it is the convention. It's ok to use other codecs for whatever reasons, constraints, use cases, etc. But these are all exceptions to the convention - UTF8. Yuval (Also a windows dev)

On 2/12/2012 7:54 AM, Paul Moore wrote:
No. I know that a lot of Unix people advocate UTF-8, and I gather it's rapidly becoming standard in the Unix world. But I work on Windows,
Unicode and utf-8 is a standard for the world, not Unix. It surpassed us-ascii as the most used character encoding for the WWW about 4 years ago. https://en.wikipedia.org/wiki/ASCII XML is unicode based. I think it fair to say that UTF-8 (and UTF-16) are preferred encodings, as 'Encodings other than UTF-8 and UTF-16 will not necessarily be recognized by every XML parser' https://en.wikipedia.org/wiki/Xml#Encoding_detection OpenDocument is one of many xml-based formats. Any modern database program that intends to store arbitrary text must store unicode (or at least the BMP subset). So any text-oriented Windows program that gets input from the rest of the world has to handle unicode and at least the utf-8 encoding thereof. My impression is that Windows itself now uses unicode for text storage. It is a shame that it still somewhat hides that by using limited subset codepage facades. None of this minimizes the problem of dealing with text in the multiplicity of national and language encodings. None that is not the fault of unicode, and unicode makes dealing with multiple encodings at the same time much easier. It is too bad that unicode was only developed in the 1990s instead of the 1960s. -- Terry Jan Reedy

Paul Moore writes:
It is. All of Microsoft's programs (and I suppose most third-party software, too) that I know of will happily import UTF-8-encoded text, and produce it as well. Most Microsoft-specific file formats (eg, Word) use UTF-16 internally, but they can't be read by most text-oriented programs, so in practice they're app/octet-strm. The problem is the one you point out: files you receive from third parties are still fairly likely to be in a non-Unicode encoding.
True. But for personal use, and for communicating with people you have some influence over, you can use/recommend UTF-8 safely as far I know. I occasionally get asked by Japanese people why files I send in UTF-8 are broken; it invariably turns out that they sent me a file in Shift JIS that contained a non-JIS (!) character and my software translated it to REPLACEMENT CHARACTER before sending as UTF-8.
I don't understand what you mean by that. Windows doesn't make handling any non-Unicode encodings easy, in my experience, except for the local code page. So, OK, if you're in a monolingual Windows environment (eg, the typical Japanese office), everybody uses a common legacy encoding for file exchange (including URLs and MIME filename= :-(, in particular Shift JIS), and only that encoding works well (ie, without the assistance of senior tech support personnel). Handling Unicode, though, isn't really an issue; all of Microsoft's programs happily deal with UTF-8 and UTF-16 (in its several varieties).
Indeed. Do you really see UTF-16 in files that you process with Python?

Sorry for the self-reply, but this should be clarified. Stephen J. Turnbull writes:
Ie, the breakage that you're likely to encounter in using UTF-8 wherever possible is *very* minor, and typically related to somebody else failing to conform to standards.

On 13 February 2012 05:42, Stephen J. Turnbull <stephen@xemacs.org> wrote:
If I create a new text file in Notepad or Vim on my PC, it's not created in UTF-8 by default. Vim uses Latin-1, and Notepad uses "ANSI" (which I'm pretty sure translates to CP1252 (but there are so few differences between this and latin-1, that I can't easily test this at the moment). If I do "chcp" on a console window, I get codepage 850, and in CMD, echo a£b >file.txt encodes the file in CP850. echo a£b >file.txt in Powershell creates little-endian UTF-16 with a BOM. The out-file cmdlet in Powershell (which lets me specify an encoding to override the UTF-16 of the standard redirection) says this about the encoding parameter: -Encoding <string> Specifies the type of character encoding used in the file. Valid values are "Unicode", "UTF7", "UTF8", "UTF32 "ASCII", "BigEndianUnicode", "Default", and "OEM". "Unicode" is the default. "Default" uses the encoding of the system's current ANSI code page. "OEM" uses the current original equipment manufacturer code page identifier for the operating system. With this I can at least get UTF-8 (with BOM). But it's a long way from simple to do so... Basically, In my experience, Windows users are not likely to produce UTF-8 formatted files unless they make specific efforts to do so. I have heard anecdotal evidence that attempts to set the configuration on Windows to produce UTF-8 by default hit significant issues. So don't expect to see Windows users producing UTF-8 by default anytime soon.
The problem is the one you point out: files you receive from third parties are still fairly likely to be in a non-Unicode encoding.
And, if I don't concentrate, I produce non-UTF8 files myself. The good news is that Python 3 generally works fine with files I produce myself, as it follows the system encoding.
Near enough, as the only character I tend to use is £, and latin-1 and cp1252 concur on that (and I know what CP850 £ signs look like in latin-1/cp1252, so I can spot that particular error). Of course, that means that processing UTF-8 always needs me to explicitly set the encoding. Which in turn means that (if I care - back to the original point) I need to go checking for non-ASCII characters, do a quick hex dump to check they look like utf-8 and set the encoding. Or go with the default and risk mojibake (cp1252 is not latin-1 AIUI, so won't roundtrip bytes). Or go the "don't care" route. All of this simply because I feel that it's impolite to corrupt someone's name in my output just because they have an accented letter in their name :-) As I say: - I know what to do - It can be a lot of work - Frankly, the damage is minor (these are usually personal or low-risk scripts) - The temptation to say "stuff it" and get on with my life is high - It frustrates me that Python by default tempts me to *not* do the right thing Maybe the answer is to have some form of encoding-detection function in the standard library. It doesn't have to be 100% accurate, and it certainly shouldn't be used anywhere by default, but it would be available for people who want to do the right thing without over-engineering things totally.
Maybe it's different in Japan, where character sets are more of a common knowledge issue? But if I tried to say to one of my colleagues that the spooled output of a SQL query they sent me (from a database with one encoding, through a client with no real encoding handling beyond global OS-level defaults) didn't use UTF-8, I'd get a blank look at best. I've had to debug encoding issues for database programmers only to find that they don't even know what encodings are about - and they are writing multilingual applications! (Before someone says, yes, of course this is terrible, and shouldn't happen - but it does, and these are the places I get weirdly-encoded text files from...)
What I was trying to say was that typical Windows environments (where people don't interact often with Unix utilities, or if they do it's with ASCII characters almost exclusively) hide the details of Unicode from the end user to the extent that they don't know what's going on under the hood, and don't need to care. Much like Python 2, I guess :-)
Indeed. Do you really see UTF-16 in files that you process with Python?
Powershell generates it. See above. But no, not often, and it's easy to fix. Meh, for easy read cmd /c "iconv -f utf-16 -t utf-8 u1 >u2" or set-content u2 (get-content u1) -encoding utf8 if I don't mind a BOM. No, Unicode on Windows isn't easy :-( Paul

Paul Moore wrote:
Encoding guessers have their place, but they should only be used by those who know what they're getting themselves into. http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx Note that even Raymond Chen makes the classic error of conflating encodings (UTF-16) with Unicode. +0 on providing an encoding guesser, but -1 on making operate by default. -- Steven

Paul Moore writes:
Basically, In my experience, Windows users are not likely to produce UTF-8 formatted files unless they make specific efforts to do so.
Agreed. All I meant was that if you make the effort to do so, your Windows-based correspondents will be able to read it, and vice versa.
Please don't blame it on Python. Python tempts you because it offers the choice to do it right. There is no way that Python can do it right *for* you, not even all the resources Microsoft or Apple can bring to bear have managed to do it right (you can't get 100% even within an all-Windows or all-Mac shop, let alone cross-platform). Not yet; it requires your help. Thanks for caring!<wink/>
Maybe it's different in Japan, where character sets are more of a common knowledge issue?
Mojibake is common knowledge in Japan; what to do about it requires a specialized technical background.
Again, this is not the direction I have in mind (I'm thinking more in terms of the RightThinkingAmongUs using UTF-8 as much as possible, and whether the recipients will be able to read it -- AFAICT/IME they can), and you certainly shouldn't presume that your correspondents "should" "already" be using UTF-8. That would be seriously rude on Windows, where as you point out one has to do something rather contorted to produce UTF-8 in most applications.
Ah. If you're in a monolingual environment, yes, it works that way. But it works just well on Unix if you set LANG appropriately in your environment.
Much like Python 2, I guess :-)
No, Python 2 is better and worse. Many protocols use magic numbers that look like ASCII-encoded English (eg, HTML tags). Python 2 is quite happy to process those magic numbers and the intervening content (as long as each stretch of non-ASCII is treated as an atomic unit), regardless of whether actual encoding matches local convention. (This is why the WSGI guys love Python 2 -- it can be multilingual without knowing the encoding!) On the other hand, the Windows environment will be more seamless (and allow useful processing of the "intervening content") as long as you stick to the local convention for encoding.

Web browsers can parse pages with multiple encodings seemingly perfectly into the correct display characters. A quick copy and paste produces UTF-8 encoded text in the clip board. (on linux) HOW DO THEY DO IT.. can we have their libraries? :) Some of the web pages I tried decoding made be pull my hair out. One I just cancelled the client.

On Tue, Feb 14, 2012 at 15:38, Chris Rebert <pyideas@rebertia.com> wrote:
The "chardet" package is in fact a port of Mozilla's encoding guessing code.
I thought at some point that it would be useful to have in the stdlib (I still do). It's already fairly successful on PyPI, after all, and it's very helpful when dealing with text of unknown character encoding. However, there are licensing issues. At one point I asked Van Lindberg to look into that... He forwarded me some email between him and Mozilla guys about this, but it was not yet conclusive. Cheers, Dirkjan

oh good, this long thread has already started talking about encoding detection packages. now I don't have to bring it up. :) I suggest we link to one or more of these from the Python docs to their pypi project pages as a suggestion for users that need to deal with the real world of legacy data files in a variety of undeclared format rather than the internet world of utf-8 or bust. At some point it might be interesting to have a library like this in the stdlib along with a common API for other compatible libraries but I'm not sure any are ready for such a consideration. Is their behavior stable or still learning based on new inputs? -gps

On 14 February 2012 09:36, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Point taken. I think my point is that I wish there was a more obvious way for me to tell Python that I just want to do it nearly right on this occasion (like "everything else" does) because I really don't need to care for now. I'm getting a lot closer to knowing how to do that as this thread progresses, though, which is why I think of this as more of an educational issue than anything else. Thinking about how I'd code something like "cat" naively in C (while ((i = getchar()) != EOF) { putchar(i); }), I guess encoding=latin1 is the way for Python to "work like everything else" in this context. So I suppose there's a question. Do we really want to document how to "do it wrong"? At first glance, obviously not. But if we don't, it seems that the "Python 3 forces you to know Unicode" meme thrives, and we keep getting bad press. Maybe we could add a note to the open() documentation, something like the following: """To open a file, you need to know its encoding. This is not always obvious, depending on where the file came from, among other things. Other tools can process files without knowing the encoding by assuming the bytes of the file map 1-1 to the first 256 Unicode characters. This can cause issues such as mojibake or corrupted data, but for casual use is sometimes sufficient. To get this behaviour in Python (with all the same risks and problems) you can use the "latin1" encoding, which maps bytes to unicode as described above. It is far, far better to use the correct encoding declaration, if at all possible, however.""" I have no real opinion on whether this is the right thing to do. Unfortunately (in a sense :-)) it doesn't matter much to me any more, as I now have the benefit of learning from this thread, so I'm no longer in the target audience of the comment :-)
Thanks for caring!<wink/>
Thanks for helping me learn! Paul

Le 14/02/2012 22:08, Paul Moore a écrit :
Hi, The Python equivalent to your C program is to use bytes without decoding at all: open a file with 'rb' mode, use sys.stdin.buffer, ... I think this is the right thing to do if you want to pass through unmodified text without knowing the encoding. Regards, -- Simon Sapin

On 13 February 2012 16:42, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Paul Moore writes:
[Lots of stuff from Stephen that I agree with].
I've only had one real use-case (and it was Java, but could easily be Python). We wanted to be able to export settings as a CSV file to be opened in Excel, modified and then re-imported. Turns out that if you want to open non-ascii CSV files in Excel, they must be encoded as (IIRC) UTF-16LE (i.e. without a BOM). I think you can save as other encodings, but that's the only one you can reliably open. Tim Delaney

On 11Feb2012 13:12, Stephen J. Turnbull <stephen@xemacs.org> wrote: | Jim Jewett writes: | > Are you saying that some (many? all?) platforms make a bad choice there? | | No. I'm saying that whatever choice is made (except for 'latin-1' | because it accepts all bytes regardless of the actual encoding of the | data, or PEP 383 "errors='surrogateescape'" for the same reason, both | of which are unacceptable defaults for production code *for the same | reason*), there is data that will cause that idiom to fail on Python 3 | where it would not on Python 2. But... By your own argument here, the failing is on the part of Python 2 becuase it is passing when it should fail, because it is effectively using the equivalent of 'latin-1'. And you say right there that that is unacceptable. At least with Python 3 you find out early that you're doing something dodgy. Disclaimer: I may be talking our my arse here; my personal code is all Python 2 at present because I haven't found an idle weekend (or, more likely, week) to spend getting it python 3 ready (meaning parsing ok but probably failing a bunch of tests to start with). I do know that in Python 2 I've tripped over a heap of unicode versus latin-1/maybe-ascii text issues and python unicode-vs-str issues just recently in Python 2 and a lot of the ambiguity I've been juggling would be absent in Python 3 (because at least all the strings will be unicode and I can concentrate on the encoding/decode stuff instead). [...snip...] | The fact is that with a little bit of knowledge, you can almost | certainly get more reliable (and in case of failure, more debuggable) | results from Python 3 than from Python 2. That's my hope. | But people are happy to | deal with the devil they know, even though it's more noxious than the | devil they don't. Not me :-) I speak as one who once moved to MH mail folders and vi-with-a-few-macros as a mail reader just to break my use of the mail reader I had been using:-( Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ No system, regardless of how sophisticated, can repeal the laws of physics or overcome careless driving actions. - Mercedes Benz

Cameron Simpson writes:
At least with Python 3 you find out early that you're doing something dodgy.
The point is that there is a use case for "doing something dodgy." See Paul Moore's subthread for an example and discussion. However, I think people who do something dodgy should be forced to make it explicit in their code.

On 13Feb2012 15:23, Stephen J. Turnbull <stephen@xemacs.org> wrote: | Cameron Simpson writes: | > At least with Python 3 you find out early that you're doing something | > dodgy. | | The point is that there is a use case for "doing something dodgy." | See Paul Moore's subthread for an example and discussion. Yes. | However, I think people who do something dodgy should be forced to | make it explicit in their code. I think I agree here, too. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ There are old climbers, and there are bold climbers; but there are no old bold climbers.

Jim Jewett writes:
Are you saying that some (many? all?) platforms make a bad choice there?
No. I'm saying that whatever choice is made (except for 'latin-1' because it accepts all bytes regardless of the actual encoding of the data, or PEP 383 "errors='surrogateescape'" for the same reason, both of which are unacceptable defaults for production code *for the same reason*), there is data that will cause that idiom to fail on Python 3 where it would not on Python 2. This is especially the case if you work with older text data on Mac or modern Linux where UTF-8 is used, because you're almost certain to run into Latin-1-encoded files. My favorite example is ChangeLogs, which broke my Gentoo package manager when I experimented with using Python 3 as the default Python. Most packages would work fine, but for some reason some Python program in the PMS was actually reading the ChangeLogs, and sometimes they'd be impure ASCII (I don't recall whether it was utf-8 or latin-1), giving a fatal UnicodeError and everything grinds to a halt. That is reason enough for the naive to embrace fear, uncertainty, and doubt about Python 3's use of Unicode. The fact is that with a little bit of knowledge, you can almost certainly get more reliable (and in case of failure, more debuggable) results from Python 3 than from Python 2. But people are happy to deal with the devil they know, even though it's more noxious than the devil they don't. Counteracting FUD with words generally doesn't work IME, unless the words are a "magic spell" that reduces the unknown to the known.

On 11 February 2012 04:12, Stephen J. Turnbull <stephen@xemacs.org> wrote:
My concern about Unicode in Python 3 is that the principle is, you specify the right encoding. But often, I don't *know* the encoding ;-( Text files, like changelogs as a good example, generally have no marker specifying the encoding, and they can have all sorts (depending on where the package came from). Worse, I am on Windows and changelogs usually come from Unix developers - so I'm not familiar with the common conventions ("well, of course it's in UTF-8, that's what everyone uses"...) In Python 2, I can ignore the issue. Sure, I can end up with mojibake, but for my uses, that's not a disaster. Mostly-readable works. But in Python 3, I get an error and can't process the file. I can just use latin-1, or surrogateescape. But that doesn't come naturally to me yet. Maybe it will in time... Or maybe there's a better solution I don't know about yet. To be clear - I am fully in favour of the Python 3 approach, and I completely support the idea that people should know the encodings of the stuff they are working with (I've seen others naively make encoding mistakes often enough to know that when it matters, it really does matter). But having to worry, not so much about the encoding to use, but rather about the fact that Python is asking you a question you can't answer, is a genuine stumbling block. And from what I've seen, it's at the root of the problems many people have with Unicode in Python 3. I'm not arguing for changes to the default behaviour of Python 3. But if we had a good place to put it, a FAQ entry about "what to do if I need to process a file whose encoding I don't know" would be useful. And certainly having a standard answer that people could give when the question comes up (something practical, not a purist answer like "all files have an encoding, so you should find out") would help. Paul.

On Feb 11, 2012 12:41 PM, "Paul Moore" <p.f.moore@gmail.com> wrote
I think if the bytes type behaved exactly like python2's string it would have been the best option. When you work with "wb" or "rb" you get quite a hint that you're doing it wrong. But devs would have a viable ambiguous *string* type (vs bytes and their integer cells). Yuval

On Feb 11, 2012, at 12:40 AM, Paul Moore wrote:
I'm confused what you're asking for. Setting errors to surrogateescape or encoding to Latin-1 causes Python 3 to behave the exact same way as Python 2: it's doing the "wrong" thing and may result in mojibake, but at least it isn't screwing up anything new so long as the stuff you add to the file is in ASCII. The only way to make Python 3 slightly more like Python 2 would be to set errors="surrogateescape" by default instead of asking the programmer to know to use it. I think that would be going too far, but it could be done. I think it would be simpler though to just publicize errors="surrogateescape" more. "Dear people who don't care about encodings and don't want to take the time to get them right, just put errors='surrogateescape' into your open commands and Python 3 will behave almost exactly like Python 2. The end." Is that really so hard? I'm confused about what else people want.

On Sun, Feb 12, 2012 at 1:19 PM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
An open_ascii() builtin isn't as crazy as it may initially sound - it's not at all uncommon to have a file that's almost certainly in some ASCII compatible encoding like utf-8, latin-1 or one of the other extended ASCII encodings, but you don't know which one specifically. By offering open_ascii(), we'd be making it trivial to process such files without blowing up (or having to figure out exactly *which* ASCII compatible encoding you have). When you wrote them back to disk, if you'd added any non-ASCII chars of your own, you'd get a UnicodeEncodeError, but any encoded data from the original would be reproduced in the original encoding. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
To me, "open_ascii" suggests either: - it opens ASCII files, and raises an error if they are not ASCII; or - it opens non-ASCII files, and magically translates their content to ASCII using some variant of "The Unicode Hammer" recipe: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hamme... We should not be discouraging developers from learning even the most trivial basics of Unicode. I'm not suggesting that we try to force people to become Unicode experts (they wouldn't, even if we tried) but making this a built-in is dumbing things down too much. I don't believe that it is an imposition for people to explicitly use open(filename, 'ascii', 'surrogateescape') if that's what they want. If they want open_ascii, let them define this at the top of their modules: open_ascii = (lambda name: open(name, encoding='ascii', errors='surrogateescape')) A one liner, if you don't mind long lines. I'm not entirely happy with the surrogateescape solution, but I can see it's possibly the least worst *simple* solution for the case where you don't know the source encoding. (Encoding guessing heuristics are awesome but hardly simple.) So put the recipe in the FAQs, in the docs, and the docstring for open[1], and let people copy and paste the recipe. That's a pretty gentle introduction to Unicode. [1] Which is awfully big and complex in Python 3.1, but that's another story. -- Steven

On Sun, Feb 12, 2012 at 3:26 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yeah, it didn't take long for me to come back around to that point of view, so I morphed http://bugs.python.org/issue13997 into a docs bug about clearly articulating the absolute bare minimum knowledge of Unicode needed to process text in a robust cross-platform manner in Python 3 instead. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
+1 I think (as I've said more verbosely elsewhere) that there are two common use cases, corresponding to two different definitions of "robust text processing". (1) Use cases where you would rather risk occasionally corrupting non-ASCII text than risk *any* UnicodeErrors at all *anywhere*. They use encoding='latin-1'. (2) Use cases where you do not want to deal with encodings just to "pass through" non-ASCII text, but do want that text preserved enough to be willing to risk (rare) UnicodeErrors or validation errors from pedantic Unicode-oriented modules. They use encoding='ascii', errors='surrogateescape'.

Paul Moore wrote:
<raises eyebrow> But you obviously do know the convention -- use UTF-8.
So why don't you use UTF-8? As far as those who actually don't know the convention, isn't it better to teach them the convention "use UTF-8, unless dealing with legacy data" rather than to avoid dealing with the issue by using errors='surrogateescape'? I'd hate for "surrogateescape" to become the One Obvious Way for dealing with unknown encodings, because this is 2012 and people should be more savvy about non-ASCII characters by now. I suppose it's marginally better than just throwing them away with errors='ignore', but still. I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312. This isn't entirely a rhetorical question. I'm not on Windows, so perhaps there's a problem I'm unaware of. -- Steven

On 12 February 2012 05:03, Steven D'Aprano <steve@pearwood.info> wrote:
No. I know that a lot of Unix people advocate UTF-8, and I gather it's rapidly becoming standard in the Unix world. But I work on Windows, and UTF-8 is not the standard there. I have no idea if UTF-8 is accepted cross-platform, or if it's just what has grown as most ChangeLog files are written on Unix and Unix users don't worry about what's convenient on Windows (no criticism there, just acknowledgement of a fact). And I have seen ChangeLog files with non-UTF-8 encodings of names in them. I have no idea if that's a bug or just a preference - and anyway, "be permissive in what you accept" applies... Get beyond ChangeLog files and it's anybody's guess. My PC has text files from many, many places (some created on my PC, some created by others on various flavours and ages of Unix , and some downloaded from who-knows-where on the internet). Not one of them comes with an encoding declaration. Of course every file is encoded in some way. But it's incredibly naive to assume the user knows that encoding. Hey, I still have to dump out the content of files to check the line ending convention when working in languages other than Python - universal newlines saves me needing to care about that, why is it so disastrous to consider having something similar for encodings?
Decoding errors.
Fair comment. My point here is that I *am* dealing with "legacy" data in your sense. And I do so on a day to day basis. UTF-8 is very, very rare in my world (Windows). Latin-1 (or something close) is common. There is no cross-platform standard yet. And probably won't be until Windows moves to UTF-8 as the standard encoding. Which ain't happening soon.
I think people are much more aware of the issues, but cross-platform handling remains a hard problem. I don't wish to make assumptions, but your insistence that UTF-8 is a viable solution suggests to me that you don't know much about the handling of Unicode on Windows. I wish I had that luxury...
I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312.
£12 in what encoding? :-)
This isn't entirely a rhetorical question. I'm not on Windows, so perhaps there's a problem I'm unaware of.
I think that's the key here. Even excluding places that don't use the Roman alphabet, Windows encoding handling is complex. CP1252, CP850, Latin-1, Latin-14 (Euro zone), UTF-16, BOMs. All are in use on my PC to some extent. And that's even without all this foreign UTF-8 I get from the Unix guys :-) Apart from the blasted UTF-16, all of it's "ASCII most of the time". Paul.

Paul Moore, 12.02.2012 13:54:
Latin-1, Latin-14 (Euro zone)
OT-remark: I assume you meant ISO8859-15 (aka. Latin-9) here. However, that's not for the "Euro zone", it's just Latin-1 with the Euro character wangled in and a couple of other changes. It still lacks characters that are commonly used by languages within the Euro zone, e.g. the Slovenian language (a Slavic descendant), but also Gaelic or Welsh. https://en.wikipedia.org/wiki/ISO/IEC_8859-15#Coverage https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_commonly_supported_bu... Stefan

On Sun, Feb 12, 2012 at 2:54 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Windows NT started with UCS-16 and from Windows 2000 it's UTF-16 internally. It was an uplifting thought that unicode is just 2 bytes per letter so they did a huge refactoring of the entire windows API (ReadFileA/ReadFileW etc) thinking they won't have to worry about it again. Nowadays windows INTERNALS have the worst of all worlds - a variable char-length, uncommon unicode format, and twice the API to maintain. Notepad can open and save utf-8 files perfectly much like most other windows programs. UTF-8 is the internet standard and I suggest we keep that fact crystal clear. UTF-8 Is the goto codec, it is the convention. It's ok to use other codecs for whatever reasons, constraints, use cases, etc. But these are all exceptions to the convention - UTF8. Yuval (Also a windows dev)

On 2/12/2012 7:54 AM, Paul Moore wrote:
No. I know that a lot of Unix people advocate UTF-8, and I gather it's rapidly becoming standard in the Unix world. But I work on Windows,
Unicode and utf-8 is a standard for the world, not Unix. It surpassed us-ascii as the most used character encoding for the WWW about 4 years ago. https://en.wikipedia.org/wiki/ASCII XML is unicode based. I think it fair to say that UTF-8 (and UTF-16) are preferred encodings, as 'Encodings other than UTF-8 and UTF-16 will not necessarily be recognized by every XML parser' https://en.wikipedia.org/wiki/Xml#Encoding_detection OpenDocument is one of many xml-based formats. Any modern database program that intends to store arbitrary text must store unicode (or at least the BMP subset). So any text-oriented Windows program that gets input from the rest of the world has to handle unicode and at least the utf-8 encoding thereof. My impression is that Windows itself now uses unicode for text storage. It is a shame that it still somewhat hides that by using limited subset codepage facades. None of this minimizes the problem of dealing with text in the multiplicity of national and language encodings. None that is not the fault of unicode, and unicode makes dealing with multiple encodings at the same time much easier. It is too bad that unicode was only developed in the 1990s instead of the 1960s. -- Terry Jan Reedy

Paul Moore writes:
It is. All of Microsoft's programs (and I suppose most third-party software, too) that I know of will happily import UTF-8-encoded text, and produce it as well. Most Microsoft-specific file formats (eg, Word) use UTF-16 internally, but they can't be read by most text-oriented programs, so in practice they're app/octet-strm. The problem is the one you point out: files you receive from third parties are still fairly likely to be in a non-Unicode encoding.
True. But for personal use, and for communicating with people you have some influence over, you can use/recommend UTF-8 safely as far I know. I occasionally get asked by Japanese people why files I send in UTF-8 are broken; it invariably turns out that they sent me a file in Shift JIS that contained a non-JIS (!) character and my software translated it to REPLACEMENT CHARACTER before sending as UTF-8.
I don't understand what you mean by that. Windows doesn't make handling any non-Unicode encodings easy, in my experience, except for the local code page. So, OK, if you're in a monolingual Windows environment (eg, the typical Japanese office), everybody uses a common legacy encoding for file exchange (including URLs and MIME filename= :-(, in particular Shift JIS), and only that encoding works well (ie, without the assistance of senior tech support personnel). Handling Unicode, though, isn't really an issue; all of Microsoft's programs happily deal with UTF-8 and UTF-16 (in its several varieties).
Indeed. Do you really see UTF-16 in files that you process with Python?

Sorry for the self-reply, but this should be clarified. Stephen J. Turnbull writes:
Ie, the breakage that you're likely to encounter in using UTF-8 wherever possible is *very* minor, and typically related to somebody else failing to conform to standards.

On 13 February 2012 05:42, Stephen J. Turnbull <stephen@xemacs.org> wrote:
If I create a new text file in Notepad or Vim on my PC, it's not created in UTF-8 by default. Vim uses Latin-1, and Notepad uses "ANSI" (which I'm pretty sure translates to CP1252 (but there are so few differences between this and latin-1, that I can't easily test this at the moment). If I do "chcp" on a console window, I get codepage 850, and in CMD, echo a£b >file.txt encodes the file in CP850. echo a£b >file.txt in Powershell creates little-endian UTF-16 with a BOM. The out-file cmdlet in Powershell (which lets me specify an encoding to override the UTF-16 of the standard redirection) says this about the encoding parameter: -Encoding <string> Specifies the type of character encoding used in the file. Valid values are "Unicode", "UTF7", "UTF8", "UTF32 "ASCII", "BigEndianUnicode", "Default", and "OEM". "Unicode" is the default. "Default" uses the encoding of the system's current ANSI code page. "OEM" uses the current original equipment manufacturer code page identifier for the operating system. With this I can at least get UTF-8 (with BOM). But it's a long way from simple to do so... Basically, In my experience, Windows users are not likely to produce UTF-8 formatted files unless they make specific efforts to do so. I have heard anecdotal evidence that attempts to set the configuration on Windows to produce UTF-8 by default hit significant issues. So don't expect to see Windows users producing UTF-8 by default anytime soon.
The problem is the one you point out: files you receive from third parties are still fairly likely to be in a non-Unicode encoding.
And, if I don't concentrate, I produce non-UTF8 files myself. The good news is that Python 3 generally works fine with files I produce myself, as it follows the system encoding.
Near enough, as the only character I tend to use is £, and latin-1 and cp1252 concur on that (and I know what CP850 £ signs look like in latin-1/cp1252, so I can spot that particular error). Of course, that means that processing UTF-8 always needs me to explicitly set the encoding. Which in turn means that (if I care - back to the original point) I need to go checking for non-ASCII characters, do a quick hex dump to check they look like utf-8 and set the encoding. Or go with the default and risk mojibake (cp1252 is not latin-1 AIUI, so won't roundtrip bytes). Or go the "don't care" route. All of this simply because I feel that it's impolite to corrupt someone's name in my output just because they have an accented letter in their name :-) As I say: - I know what to do - It can be a lot of work - Frankly, the damage is minor (these are usually personal or low-risk scripts) - The temptation to say "stuff it" and get on with my life is high - It frustrates me that Python by default tempts me to *not* do the right thing Maybe the answer is to have some form of encoding-detection function in the standard library. It doesn't have to be 100% accurate, and it certainly shouldn't be used anywhere by default, but it would be available for people who want to do the right thing without over-engineering things totally.
Maybe it's different in Japan, where character sets are more of a common knowledge issue? But if I tried to say to one of my colleagues that the spooled output of a SQL query they sent me (from a database with one encoding, through a client with no real encoding handling beyond global OS-level defaults) didn't use UTF-8, I'd get a blank look at best. I've had to debug encoding issues for database programmers only to find that they don't even know what encodings are about - and they are writing multilingual applications! (Before someone says, yes, of course this is terrible, and shouldn't happen - but it does, and these are the places I get weirdly-encoded text files from...)
What I was trying to say was that typical Windows environments (where people don't interact often with Unix utilities, or if they do it's with ASCII characters almost exclusively) hide the details of Unicode from the end user to the extent that they don't know what's going on under the hood, and don't need to care. Much like Python 2, I guess :-)
Indeed. Do you really see UTF-16 in files that you process with Python?
Powershell generates it. See above. But no, not often, and it's easy to fix. Meh, for easy read cmd /c "iconv -f utf-16 -t utf-8 u1 >u2" or set-content u2 (get-content u1) -encoding utf8 if I don't mind a BOM. No, Unicode on Windows isn't easy :-( Paul

Paul Moore wrote:
Encoding guessers have their place, but they should only be used by those who know what they're getting themselves into. http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx Note that even Raymond Chen makes the classic error of conflating encodings (UTF-16) with Unicode. +0 on providing an encoding guesser, but -1 on making operate by default. -- Steven

Paul Moore writes:
Basically, In my experience, Windows users are not likely to produce UTF-8 formatted files unless they make specific efforts to do so.
Agreed. All I meant was that if you make the effort to do so, your Windows-based correspondents will be able to read it, and vice versa.
Please don't blame it on Python. Python tempts you because it offers the choice to do it right. There is no way that Python can do it right *for* you, not even all the resources Microsoft or Apple can bring to bear have managed to do it right (you can't get 100% even within an all-Windows or all-Mac shop, let alone cross-platform). Not yet; it requires your help. Thanks for caring!<wink/>
Maybe it's different in Japan, where character sets are more of a common knowledge issue?
Mojibake is common knowledge in Japan; what to do about it requires a specialized technical background.
Again, this is not the direction I have in mind (I'm thinking more in terms of the RightThinkingAmongUs using UTF-8 as much as possible, and whether the recipients will be able to read it -- AFAICT/IME they can), and you certainly shouldn't presume that your correspondents "should" "already" be using UTF-8. That would be seriously rude on Windows, where as you point out one has to do something rather contorted to produce UTF-8 in most applications.
Ah. If you're in a monolingual environment, yes, it works that way. But it works just well on Unix if you set LANG appropriately in your environment.
Much like Python 2, I guess :-)
No, Python 2 is better and worse. Many protocols use magic numbers that look like ASCII-encoded English (eg, HTML tags). Python 2 is quite happy to process those magic numbers and the intervening content (as long as each stretch of non-ASCII is treated as an atomic unit), regardless of whether actual encoding matches local convention. (This is why the WSGI guys love Python 2 -- it can be multilingual without knowing the encoding!) On the other hand, the Windows environment will be more seamless (and allow useful processing of the "intervening content") as long as you stick to the local convention for encoding.

Web browsers can parse pages with multiple encodings seemingly perfectly into the correct display characters. A quick copy and paste produces UTF-8 encoded text in the clip board. (on linux) HOW DO THEY DO IT.. can we have their libraries? :) Some of the web pages I tried decoding made be pull my hair out. One I just cancelled the client.

On Tue, Feb 14, 2012 at 15:38, Chris Rebert <pyideas@rebertia.com> wrote:
The "chardet" package is in fact a port of Mozilla's encoding guessing code.
I thought at some point that it would be useful to have in the stdlib (I still do). It's already fairly successful on PyPI, after all, and it's very helpful when dealing with text of unknown character encoding. However, there are licensing issues. At one point I asked Van Lindberg to look into that... He forwarded me some email between him and Mozilla guys about this, but it was not yet conclusive. Cheers, Dirkjan

oh good, this long thread has already started talking about encoding detection packages. now I don't have to bring it up. :) I suggest we link to one or more of these from the Python docs to their pypi project pages as a suggestion for users that need to deal with the real world of legacy data files in a variety of undeclared format rather than the internet world of utf-8 or bust. At some point it might be interesting to have a library like this in the stdlib along with a common API for other compatible libraries but I'm not sure any are ready for such a consideration. Is their behavior stable or still learning based on new inputs? -gps

On 14 February 2012 09:36, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Point taken. I think my point is that I wish there was a more obvious way for me to tell Python that I just want to do it nearly right on this occasion (like "everything else" does) because I really don't need to care for now. I'm getting a lot closer to knowing how to do that as this thread progresses, though, which is why I think of this as more of an educational issue than anything else. Thinking about how I'd code something like "cat" naively in C (while ((i = getchar()) != EOF) { putchar(i); }), I guess encoding=latin1 is the way for Python to "work like everything else" in this context. So I suppose there's a question. Do we really want to document how to "do it wrong"? At first glance, obviously not. But if we don't, it seems that the "Python 3 forces you to know Unicode" meme thrives, and we keep getting bad press. Maybe we could add a note to the open() documentation, something like the following: """To open a file, you need to know its encoding. This is not always obvious, depending on where the file came from, among other things. Other tools can process files without knowing the encoding by assuming the bytes of the file map 1-1 to the first 256 Unicode characters. This can cause issues such as mojibake or corrupted data, but for casual use is sometimes sufficient. To get this behaviour in Python (with all the same risks and problems) you can use the "latin1" encoding, which maps bytes to unicode as described above. It is far, far better to use the correct encoding declaration, if at all possible, however.""" I have no real opinion on whether this is the right thing to do. Unfortunately (in a sense :-)) it doesn't matter much to me any more, as I now have the benefit of learning from this thread, so I'm no longer in the target audience of the comment :-)
Thanks for caring!<wink/>
Thanks for helping me learn! Paul

Le 14/02/2012 22:08, Paul Moore a écrit :
Hi, The Python equivalent to your C program is to use bytes without decoding at all: open a file with 'rb' mode, use sys.stdin.buffer, ... I think this is the right thing to do if you want to pass through unmodified text without knowing the encoding. Regards, -- Simon Sapin

On 13 February 2012 16:42, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Paul Moore writes:
[Lots of stuff from Stephen that I agree with].
I've only had one real use-case (and it was Java, but could easily be Python). We wanted to be able to export settings as a CSV file to be opened in Excel, modified and then re-imported. Turns out that if you want to open non-ascii CSV files in Excel, they must be encoded as (IIRC) UTF-16LE (i.e. without a BOM). I think you can save as other encodings, but that's the only one you can reliably open. Tim Delaney

On 11Feb2012 13:12, Stephen J. Turnbull <stephen@xemacs.org> wrote: | Jim Jewett writes: | > Are you saying that some (many? all?) platforms make a bad choice there? | | No. I'm saying that whatever choice is made (except for 'latin-1' | because it accepts all bytes regardless of the actual encoding of the | data, or PEP 383 "errors='surrogateescape'" for the same reason, both | of which are unacceptable defaults for production code *for the same | reason*), there is data that will cause that idiom to fail on Python 3 | where it would not on Python 2. But... By your own argument here, the failing is on the part of Python 2 becuase it is passing when it should fail, because it is effectively using the equivalent of 'latin-1'. And you say right there that that is unacceptable. At least with Python 3 you find out early that you're doing something dodgy. Disclaimer: I may be talking our my arse here; my personal code is all Python 2 at present because I haven't found an idle weekend (or, more likely, week) to spend getting it python 3 ready (meaning parsing ok but probably failing a bunch of tests to start with). I do know that in Python 2 I've tripped over a heap of unicode versus latin-1/maybe-ascii text issues and python unicode-vs-str issues just recently in Python 2 and a lot of the ambiguity I've been juggling would be absent in Python 3 (because at least all the strings will be unicode and I can concentrate on the encoding/decode stuff instead). [...snip...] | The fact is that with a little bit of knowledge, you can almost | certainly get more reliable (and in case of failure, more debuggable) | results from Python 3 than from Python 2. That's my hope. | But people are happy to | deal with the devil they know, even though it's more noxious than the | devil they don't. Not me :-) I speak as one who once moved to MH mail folders and vi-with-a-few-macros as a mail reader just to break my use of the mail reader I had been using:-( Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ No system, regardless of how sophisticated, can repeal the laws of physics or overcome careless driving actions. - Mercedes Benz

Cameron Simpson writes:
At least with Python 3 you find out early that you're doing something dodgy.
The point is that there is a use case for "doing something dodgy." See Paul Moore's subthread for an example and discussion. However, I think people who do something dodgy should be forced to make it explicit in their code.

On 13Feb2012 15:23, Stephen J. Turnbull <stephen@xemacs.org> wrote: | Cameron Simpson writes: | > At least with Python 3 you find out early that you're doing something | > dodgy. | | The point is that there is a use case for "doing something dodgy." | See Paul Moore's subthread for an example and discussion. Yes. | However, I think people who do something dodgy should be forced to | make it explicit in their code. I think I agree here, too. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ There are old climbers, and there are bold climbers; but there are no old bold climbers.
participants (19)
-
Barry Warsaw
-
Cameron Simpson
-
Carl M. Johnson
-
Chris Rebert
-
Christopher Reay
-
Dirkjan Ochtman
-
Eric Snow
-
Gregory P. Smith
-
Jim Jewett
-
Nick Coghlan
-
Niki Spahiev
-
Paul Moore
-
Simon Sapin
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Yuval Greenfield