Fall back to encoding unicode strings in utf-8 if latin-1 fails in http.client

Hi, I hope python-ideas is the right place to post this, I'm very new to this and appreciate a pointer in the right direction if this is not it. The requests project is getting multiple bug reports about a problem in the stdlib http.client, so I thought I'd raise an issue about it here. The bug reports concern people posting http requests with unicode strings when they should be using utf-8 encoded strings. Since RFC 2616 says latin-1 is the default encoding http.client tries that and fails with a UnicodeEncodeError. My idea is NOT to change from latin-1 to something else, that would break compliance with the spec, but instead catch that exception, and try encoding with utf-8 instead. That would avoid breaking backward compatibility, unless someone specifically relied on that exception, which I think is very unlikely. This is also how other languages http libraries seem to deal with this, sending in unicode just works: In cURL (works fine): curl http://example.com -d "Celebrate 🎉" In Ruby with http.rb (works fine): require 'http' r = HTTP.post("http://example.com", :body => "Celebrate 🎉) In Node with request (works fine): var request = require('request'); request.post({url: 'http://example.com', body: "Celebrate 🎉"}, function (error, response, body) { console.log(body) }) But Python 3 with requests crashes instead: import requests r = requests.post("http://localhost:8000/tag", data="Celebrate 🎉") ...with the following stacktrace: ... File "../lib/python3.4/http/client.py", line 1127, in _send_request body = body.encode('iso-8859-1') UnicodeEncodeError: 'latin-1' codec can't encode characters in position 14-15: ordinal not in range(256) ---- So the rationale for this idea is: * http.client doesn't work the way beginners expect for very basic usecases (posting unicode strings) * Libraries in other languages behave like beginners expect, which magnifies the problem. * Changing the default latin-1 encoding probably isn't possible, because it would break the spec... * But catching the exception and try encoding in utf-8 instead wouldn't break the spec and solves the problem. ---- Here's a couple of issues where people expect things to work differently: https://github.com/kennethreitz/requests/issues/1926 https://github.com/kennethreitz/requests/issues/2838 https://github.com/kennethreitz/requests/issues/1822 ---- Does this make sense? /Emil

On Thu, Jan 7, 2016 at 8:20 PM, Emil Stenström <em@kth.se> wrote:
It makes sense, but I disagree with the suggestion. Having "Latin-1 or UTF-8" as the effective default encoding is not a good idea, IMO; sometimes I've *de*coded text using such heuristics (the other order, of course; attempt UTF-8 decode, and if that fail, decode as Latin-1 or possibly CP-1252) as a means of coping with broken systems, but I would much prefer the default to simply be one or the other. As the 'requests' module is not part of Python's standard library, it would be free to change its own default, regardless of the behaviour of http.client; whether that's a good idea or not is for the requests community to decide (unless there's something specifically binding it to http.client). But whether you're asking for a change in http.client or in requests, I would disagree with the "either-or" approach; change to a UTF-8 default, perhaps, but not to the hybrid. ChrisA

On Thu, Jan 07, 2016 at 08:49:55PM +1100, Chris Angelico wrote:
It makes sense, but I disagree with the suggestion. Having "Latin-1 or UTF-8" as the effective default encoding is not a good idea, IMO;
I'm curious what your reasoning is. That seems to be fairly common behavious with some email clients, for example I seem to recall that Thunderbird will try encoding emails as US-ASCII, if that fails, Latin-1, and only send UTF-8 if the other two don't work. I'm not defending this tactic, but wondering what you have against it. -- Steve

On 2016-01-07 13:59, Steven D'Aprano wrote:
I'm fine with either tactic, either defaulting to utf-8 or trying them one after the other. The important thing for me is that the API works as expected by many. My main reason for not changing the default was that it would break backwards compatibility, but only for the case that people sent latin-1 strings as if they where unicode strings. If the reading of the spec that led to using latin-1 is incorrect that really makes we question if having latin-1 there is a good idea from the start. So I'm definitely pro switching to utf-8 as default as it would make the API work like many (including me) would expect. /Emil

On Thu, Jan 7, 2016 at 11:59 PM, Steven D'Aprano <steve@pearwood.info> wrote:
An application is free to do that if it likes, although personally I wouldn't bother. For a library, I'd much rather the rules be as simple as possible. Maybe "ASCII or UTF-8" (since one is a strict subset of the other), but not "ASCII or Latin-1 or UTF-7". I'd prefer something extremely simple: if you don't specify an encoding, it has one default. That corresponds to a function signature that says encoding="UTF-8", and you can be 100% confident that omitting the encoding parameter will do the same thing as passing "UTF-8". ChrisA

Thanks especially to Cory for digging into the source and the RFCs here! Personally I'm perplexed that Requests, which claims to be "HTTP for Humans" doesn't take care of this but just lets http/client.py blow up. (However, IIUC both 2838 and 1822 are about the body.encode() call in Python 3's http/client.py at _send_request(). 1926 seems to originate in Requests itself; it's also Python 2.7.) Anyways, if we were to follow the Python 3 philosophy regarding Unicode to the letter we would have to reject the str type altogether here, and insist on bytes. The error message could tell the caller what to do, e.g. "use data.encode('utf-8') if you want the data to be encoded in UTF-8". (Then of course the server might not like it.) An alternative could be to look at the content-type header (if one is given) and use the charset from there or the default from the RFC for the content/type. But all these are rather painfully backwards incompatible, which is a big concern here. Maybe the best solution (most backward compatible *and* most likely to stem the flood of bug reports) is to just catch the UnicodeError and replace its message with something more Human-friendly, explaining that the data must be encoded before sending it. Then the user can figure out what encoding to use (though yes, most likely UTF-8 is it, so the message could suggest trying that first). -- --Guido van Rossum (python.org/~guido)

On 7 Jan 2016, at 16:32, Guido van Rossum <guido@python.org> wrote:
Personally I'm perplexed that Requests, which claims to be "HTTP for Humans" doesn't take care of this but just lets http/client.py blow up. (However, IIUC both 2838 and 1822 are about the body.encode() call in Python 3's http/client.py at _send_request(). 1926 seems to originate in Requests itself; it's also Python 2.7.)
The main reason is historical: this was missed in the original (substantial) rewrite in requests 2.0, and as a result we can’t change it without a backward compat break, just the same as Python. We’ll probably fix it in 3.0. Cory

Den 2016-01-07 kl. 17:46, skrev Cory Benfield:
So as things stand: * The general consensus seems to be that the raised error should be changed to something like: TypeError("Unicode string supplied without an explicit encoding") * Python would like to change http.client to reject unicode input with an exception, but won't because of backwards compatibility * Requests would like to do the same but won't because of backwards compatibility I think it will be very hard to find code that breaks because of a type change in the exception when sending invalid data. On the other hand, it's VERY easy to find people that are affected by the confusing error currently in use everywhere. When a backward compatible change makes life easier for 99.9% of users, and 0.1% of users need to debug a TypeError with a very clear error message (which was probably a bug in their code to begin with), I'm starting to question having a policy that strict. /Emil

On Thu, Jan 7, 2016 at 10:50 AM, Emil Stenström <em@kth.se> wrote:
What policy are you referring to? I don't think anyone objects against making the error message clearer. The objection is against rejecting unicode strings that in the past would have been successfully encoded using Latin-1. I'm not sure whether it's a good idea to change the exception type from TypeError to UnicodeError -- the exception is really related to Unicode so keeping UnicodeError but changing the message sounds like the right thing to do. And this can be done independently in both Requests and the stdlib. -- --Guido van Rossum (python.org/~guido)

On Thursday, January 7, 2016 11:05 AM, Guido van Rossum <guido@python.org> wrote:
I'm not sure whether it's a good idea to change the exception type from TypeError to UnicodeError -- the exception is really related to Unicode so keeping UnicodeError but changing the message sounds like the right thing to do. And this can be done independently in both Requests and the stdlib. That sounds like a good idea. A UnicodeEncodeError (or subclass of it?) with text like "HTTP body without encoding defaults to 'latin-1', which can't encode character '\u5555' in position 30: ordinal not in range(256)") would be pretty simple to implement, and would help a lot more than the current text. (And, for those who still can't figure it out, being a unique error message means that within a few days of the change, googling it should get a relevant StackOverflow answer, which isn't true for the generic encoding error message.)
Requests could get fancier. For example, if the string starts with "{", make the error message ask if maybe they wanted to use json=obj instead of data=json.dumps(obj). But I think that wouldn't be appropriate for the stdlib. (Especially since http.client doesn't have a json parameter...) But then it sounds like Requests is planning to remove implicitly-Latin-1 strings via data= anyway in 3.0, which would solve the problem more simply.

Den 2016-01-07 kl. 20:04, skrev Guido van Rossum:
What policy are you referring to?
I was reading https://www.python.org/dev/peps/pep-0387/#backwards-compatibility-rules, which specifies "raised exceptions", but I see now that it's only a draft.
Then I misunderstood, sorry.
Agreed. I would also suggest adding the suggestion of encoding in "utf-8" specifically which is most likely what will fix the problem. As time goes by and more and more legacy systems disappear, this advise will become truer each year. /Emil

On Jan 7, 2016, at 11:40, Emil Stenström <em@kth.se> wrote:
Agreed. I would also suggest adding the suggestion of encoding in "utf-8" specifically which is most likely what will fix the problem. As time goes by and more and more legacy systems disappear, this advise will become truer each year.
I disagree. Services that take raw, unformatted text as HTTP bodies and do something useful with it are disappearing in general, not changing the encoding they use for that raw, unformatted text from Latin-1 to UTF-8. And they were never that common in the first place. So we shouldn't be making it easier to send raw, unformatted text as UTF-8; we should be making it easier to send JSON, form-encoded, multipart, XML, etc. Which, again, Requests already does.

Den 2016-01-07 kl. 21:24, skrev Guido van Rossum:
It's time that someone files a tracker issue so we can move the remaining discussion there.
Here is the relevant issue: http://bugs.python.org/issue26045 /Emil

Den 2016-01-07 kl. 21:04, skrev Andrew Barnert:
I just wrote a service like this last week. It takes raw unformatted text and returns part-of-speech tags for the text as JSON. That's common for NLP services that structure unstructured text. The rationale for accepting POST body is simply that it makes it very simple to call the service from curl: curl http://example.com -d "string here" So there's no reason these kinds of services would be disappearing. Let's continue the discussion in the bug tracker: http://bugs.python.org/issue26045 /Emil

On Thu, Jan 7, 2016, at 07:59, Steven D'Aprano wrote:
Sure, but it includes a content-type header with a charset parameter. I think the behavior of encoding text but not including a charset parameter is fundamentally broken. If the user supplies a charset parameter, it should try to use the matching encoding, otherwise it should pick an encoding (whether that is "always UTF-8" or some other rule) and add the charset parameter.

On 7 Jan 2016, at 09:20, Emil Stenström <em@kth.se> wrote:
Since RFC 2616 says latin-1 is the default encoding http.client tries that and fails with a UnicodeEncodeError.
I cannot stress this enough: there is *no* default encoding for HTTP bodies! This conversation is very confused, and it all starts because of a thoroughly misleading comment in http.client. Firstly, let’s all remember that RFC 2616 is dead (hurrah!), now superseded by RFCs 7230 through 7238. However, http.client blames its decision on RFC 2616. Note the comment here[0]. This is (in my view) a *misreading* of RFC 2616 Section 3.7.1, which says:
The thing is, this paragraph is referring to MIME types: that is, when the Content-Type header reads “text/<something>”, and specifies no charset parameter, the body should be encoded in UTF-8. That, of course, is not the invariant this code enforces. Instead, this code spots the *only* explicit reference to a text encoding and chooses to use it for any unicode string sent by the user. That’s a somewhat defensible decision, though it’s not the one I’d have made. *However*, that fallback was removed in RFC 7231. In appendix B of that RFC, we see this note:
This means there is no longer a default content encoding for HTTP, and instead the default encoding varies based on media type. The relevant RFC for this is RFC 6657, which specifies the following things: - The default encoding for text/plain is US-ASCII - All other text subtypes either MUST provide a charset parameter that explicitly indicates what their encoding is, or MUST NOT provide one under any circumstances and instead carry that information in their contents (e.g. HTML, XML). That is to say, there are no defaults for text/* encodings: only explicit encoding choices! This whole thing was really very confusing from the beginning. IMO, the only safe decision is for http.client to simply refuse to accept unicode strings *at all* as request bodies: the ambiguity over what they mean is simply too great. Requests has had a large number of bug reports from people who claimed that something “didn’t work”, when in practice there was just a disagreement over what the correct encoding of something was. And having written both a HTTP/1.1 and a HTTP/2 client myself, in both cases I restricted the arguments of HTTPConnection.send() to bytestrings. For what it’s worth, I don’t believe it’s a good idea to change the default body encoding of unicode strings. This is the kind of really perplexing change that takes working code that implicitly relies on this behaviour and breaks it. In my experience, breakage of this manner is particularly tricky to catch because anything that can be validly encoded as Latin-1 can be validly encoded as UTF-8, so the failure will manifest as request failures rather than tracebacks. In this instance I believe the http.client module has made its bed, and will need to lie in it. If this *did* change, Requests would (at least for the remainder of the 2.X release cycle) need to enforce the Latin-1 behaviour itself for the very same backward compatibility reasons, which removes any benefit we’d get from this anyway. The really correct behaviour would be to tell users they cannot send unicode strings, because it makes no sense. That’s a change I could get behind. But moving from one guess to another, even though the new guess is more likely to be right, seems to me to be misunderstanding the problem. Cory N.B: I should note that only one of the linked requests issues, #2838, is actually about the request body. Of the others, one is about unicode in the request URI and one is about unicode in header values. This set of related issues demonstrates an ongoing confusion amongst users about what unicode strings are and how they work, but that’s a separate discussion to this one. [0]: https://github.com/python/cpython/blob/master/Lib/http/client.py#L1173-L1176

On 7 January 2016 at 09:20, Emil Stenström <em@kth.se> wrote:
In a Unix shell, this would be supplying a bytestring argument to the curl exe, that encoded the characters in whatever language setting the user had specified (likely UTF-8). In Windows Powershell (the only Windows shell I can think of that would support Unicode) what would happen would depend on how curl accessed its command line. This probably relies on which specific CRT the code was built with.
I don't know how Ruby handles Unicode, but would that body argument *actually* be Unicode, or would it be a UTF-8 encoded bytestring? I have a vague recollection that Ruby uses a "utf-8 for internal string encodings" model, which may mean it's not as strict as Python 3 is about separating bytestrings and Unicode strings...
Same response here as for Ruby. It depends on the semantics of the language regarding Unicode support as to what's happening here.
What does the requests documentation say it'll do with a Unicode string being passed as POST data to a request where there's no encoding? If it says it'll encode as latin-1, then that error is entirely correct. If it says it'll encode in some other encoding, then it isn't doing so (and that's a requests bug). If it's not explaining what it's doing, then the requests documentation is doing its users a disservice by not explaining the realities of sending Unicode over a byte-oriented protocol - and it's also leaving a huge "undefined behaviour" hole that people are falling into. I understand that beginners are confused by the apparent problem that other environments "just work", but they really don't - and the problems will hit the user further down the line, when the issue is harder to debug. For example, you're completely ignoring the potential issue of what the target server will do when faced with UTF-8 data - there's no guarantee that it will work in general. So IMO, this needs to be addressed as a documentation (and possibly code) fix in requests. It's something of a shame that httplib.client doesn't reject Unicode strings rather than making a silent assumption of the encoding, but that's something we have to live with for backward compatibility reasons. But there's no reason requests has to expose that behaviour to the user. Paul

On Thu, Jan 7, 2016 at 10:37 PM, Paul Moore <p.f.moore@gmail.com> wrote:
So IMO, this needs to be addressed as a documentation (and possibly code) fix in requests. It's something of a shame that httplib.client doesn't reject Unicode strings rather than making a silent assumption of the encoding, but that's something we have to live with for backward compatibility reasons. But there's no reason requests has to expose that behaviour to the user.
Personally, I would be happy with any of three behaviours: 1) Raise TypeError and demand that byte strings be used 2) Encode as UTF-8, since that's most likely to "just work", and is also consistent 3) Encode as ASCII, and let any errors bubble up. But, backward compat. ChrisA

On 7 January 2016 at 11:53, Chris Angelico <rosuav@gmail.com> wrote:
3) Encode as ASCII, and let any errors bubble up.
4) Encode as ASCII and catch UnicodeEncodeError and re-raise as a TypeError "Unicode string supplied without an explicit encoding". IMO, the underlying encoding errors are very user-unfriendly, and should nearly always be caught internally and replaced with something more user friendly. Most of the user confusion I see from Unicode issues could probably be significantly alleviated if the user was presented with something better than a raw (en/de)coding error and traceback. Paul

On Thu, Jan 7, 2016, at 06:53, Chris Angelico wrote:
What about: 4) Silently add a content type (default text/plain; charset=UTF-8) or charset (if the user has specified a content type without one) if a unicode string is used. If a byte string is used, use application/octet-stream for the default content type and don't add a charset in any case (even if the user-specified content type is text/*)

On Jan 7, 2016, at 01:20, Emil Stenström <em@kth.se> wrote:
This is also how other languages http libraries seem to deal with this, sending in unicode just works:
No, sending Unicode as UTF-8 doesn't "just work", except when the server is expecting UTF-8. Otherwise, it just makes the problem harder to debug. Most commonly, people who run into this problem with requests are trying to send JSON or form-encoded data. In either case, the solution is simple: just pass the object to the json= or data= parameter. It's only if you try to do it half-way yourself, calling json.dumps but then not calling .encode, that you run into a problem. I've also seen people run into this uploading files. Again, if you let requests just take care of it for you (by passing it the filename or file object), it just works. But if you try to do it half-way, reading the whole file into memory as a string but not encoding it, that's when you have problems. The solution in every case is simple: don't make things harder for yourself by doing extra work and then trying to use the lower-level API, just let requests do it for you. Of course if you're using http.client or urllib instead of requests, you don't have that option. But if http.client is too low-level for you, the solution isn't to hack up http.client to be more magical when used by people who don't know what they're doing in hopes that it'll work more often than it'll cause further and harder-to-debug problems, it's to tell them to use requests if they don't want to learn what they're doing.

On Thu, Jan 7, 2016 at 8:20 PM, Emil Stenström <em@kth.se> wrote:
It makes sense, but I disagree with the suggestion. Having "Latin-1 or UTF-8" as the effective default encoding is not a good idea, IMO; sometimes I've *de*coded text using such heuristics (the other order, of course; attempt UTF-8 decode, and if that fail, decode as Latin-1 or possibly CP-1252) as a means of coping with broken systems, but I would much prefer the default to simply be one or the other. As the 'requests' module is not part of Python's standard library, it would be free to change its own default, regardless of the behaviour of http.client; whether that's a good idea or not is for the requests community to decide (unless there's something specifically binding it to http.client). But whether you're asking for a change in http.client or in requests, I would disagree with the "either-or" approach; change to a UTF-8 default, perhaps, but not to the hybrid. ChrisA

On Thu, Jan 07, 2016 at 08:49:55PM +1100, Chris Angelico wrote:
It makes sense, but I disagree with the suggestion. Having "Latin-1 or UTF-8" as the effective default encoding is not a good idea, IMO;
I'm curious what your reasoning is. That seems to be fairly common behavious with some email clients, for example I seem to recall that Thunderbird will try encoding emails as US-ASCII, if that fails, Latin-1, and only send UTF-8 if the other two don't work. I'm not defending this tactic, but wondering what you have against it. -- Steve

On 2016-01-07 13:59, Steven D'Aprano wrote:
I'm fine with either tactic, either defaulting to utf-8 or trying them one after the other. The important thing for me is that the API works as expected by many. My main reason for not changing the default was that it would break backwards compatibility, but only for the case that people sent latin-1 strings as if they where unicode strings. If the reading of the spec that led to using latin-1 is incorrect that really makes we question if having latin-1 there is a good idea from the start. So I'm definitely pro switching to utf-8 as default as it would make the API work like many (including me) would expect. /Emil

On Thu, Jan 7, 2016 at 11:59 PM, Steven D'Aprano <steve@pearwood.info> wrote:
An application is free to do that if it likes, although personally I wouldn't bother. For a library, I'd much rather the rules be as simple as possible. Maybe "ASCII or UTF-8" (since one is a strict subset of the other), but not "ASCII or Latin-1 or UTF-7". I'd prefer something extremely simple: if you don't specify an encoding, it has one default. That corresponds to a function signature that says encoding="UTF-8", and you can be 100% confident that omitting the encoding parameter will do the same thing as passing "UTF-8". ChrisA

Thanks especially to Cory for digging into the source and the RFCs here! Personally I'm perplexed that Requests, which claims to be "HTTP for Humans" doesn't take care of this but just lets http/client.py blow up. (However, IIUC both 2838 and 1822 are about the body.encode() call in Python 3's http/client.py at _send_request(). 1926 seems to originate in Requests itself; it's also Python 2.7.) Anyways, if we were to follow the Python 3 philosophy regarding Unicode to the letter we would have to reject the str type altogether here, and insist on bytes. The error message could tell the caller what to do, e.g. "use data.encode('utf-8') if you want the data to be encoded in UTF-8". (Then of course the server might not like it.) An alternative could be to look at the content-type header (if one is given) and use the charset from there or the default from the RFC for the content/type. But all these are rather painfully backwards incompatible, which is a big concern here. Maybe the best solution (most backward compatible *and* most likely to stem the flood of bug reports) is to just catch the UnicodeError and replace its message with something more Human-friendly, explaining that the data must be encoded before sending it. Then the user can figure out what encoding to use (though yes, most likely UTF-8 is it, so the message could suggest trying that first). -- --Guido van Rossum (python.org/~guido)

On 7 Jan 2016, at 16:32, Guido van Rossum <guido@python.org> wrote:
Personally I'm perplexed that Requests, which claims to be "HTTP for Humans" doesn't take care of this but just lets http/client.py blow up. (However, IIUC both 2838 and 1822 are about the body.encode() call in Python 3's http/client.py at _send_request(). 1926 seems to originate in Requests itself; it's also Python 2.7.)
The main reason is historical: this was missed in the original (substantial) rewrite in requests 2.0, and as a result we can’t change it without a backward compat break, just the same as Python. We’ll probably fix it in 3.0. Cory

Den 2016-01-07 kl. 17:46, skrev Cory Benfield:
So as things stand: * The general consensus seems to be that the raised error should be changed to something like: TypeError("Unicode string supplied without an explicit encoding") * Python would like to change http.client to reject unicode input with an exception, but won't because of backwards compatibility * Requests would like to do the same but won't because of backwards compatibility I think it will be very hard to find code that breaks because of a type change in the exception when sending invalid data. On the other hand, it's VERY easy to find people that are affected by the confusing error currently in use everywhere. When a backward compatible change makes life easier for 99.9% of users, and 0.1% of users need to debug a TypeError with a very clear error message (which was probably a bug in their code to begin with), I'm starting to question having a policy that strict. /Emil

On Thu, Jan 7, 2016 at 10:50 AM, Emil Stenström <em@kth.se> wrote:
What policy are you referring to? I don't think anyone objects against making the error message clearer. The objection is against rejecting unicode strings that in the past would have been successfully encoded using Latin-1. I'm not sure whether it's a good idea to change the exception type from TypeError to UnicodeError -- the exception is really related to Unicode so keeping UnicodeError but changing the message sounds like the right thing to do. And this can be done independently in both Requests and the stdlib. -- --Guido van Rossum (python.org/~guido)

On Thursday, January 7, 2016 11:05 AM, Guido van Rossum <guido@python.org> wrote:
I'm not sure whether it's a good idea to change the exception type from TypeError to UnicodeError -- the exception is really related to Unicode so keeping UnicodeError but changing the message sounds like the right thing to do. And this can be done independently in both Requests and the stdlib. That sounds like a good idea. A UnicodeEncodeError (or subclass of it?) with text like "HTTP body without encoding defaults to 'latin-1', which can't encode character '\u5555' in position 30: ordinal not in range(256)") would be pretty simple to implement, and would help a lot more than the current text. (And, for those who still can't figure it out, being a unique error message means that within a few days of the change, googling it should get a relevant StackOverflow answer, which isn't true for the generic encoding error message.)
Requests could get fancier. For example, if the string starts with "{", make the error message ask if maybe they wanted to use json=obj instead of data=json.dumps(obj). But I think that wouldn't be appropriate for the stdlib. (Especially since http.client doesn't have a json parameter...) But then it sounds like Requests is planning to remove implicitly-Latin-1 strings via data= anyway in 3.0, which would solve the problem more simply.

Den 2016-01-07 kl. 20:04, skrev Guido van Rossum:
What policy are you referring to?
I was reading https://www.python.org/dev/peps/pep-0387/#backwards-compatibility-rules, which specifies "raised exceptions", but I see now that it's only a draft.
Then I misunderstood, sorry.
Agreed. I would also suggest adding the suggestion of encoding in "utf-8" specifically which is most likely what will fix the problem. As time goes by and more and more legacy systems disappear, this advise will become truer each year. /Emil

On Jan 7, 2016, at 11:40, Emil Stenström <em@kth.se> wrote:
Agreed. I would also suggest adding the suggestion of encoding in "utf-8" specifically which is most likely what will fix the problem. As time goes by and more and more legacy systems disappear, this advise will become truer each year.
I disagree. Services that take raw, unformatted text as HTTP bodies and do something useful with it are disappearing in general, not changing the encoding they use for that raw, unformatted text from Latin-1 to UTF-8. And they were never that common in the first place. So we shouldn't be making it easier to send raw, unformatted text as UTF-8; we should be making it easier to send JSON, form-encoded, multipart, XML, etc. Which, again, Requests already does.

Den 2016-01-07 kl. 21:24, skrev Guido van Rossum:
It's time that someone files a tracker issue so we can move the remaining discussion there.
Here is the relevant issue: http://bugs.python.org/issue26045 /Emil

Den 2016-01-07 kl. 21:04, skrev Andrew Barnert:
I just wrote a service like this last week. It takes raw unformatted text and returns part-of-speech tags for the text as JSON. That's common for NLP services that structure unstructured text. The rationale for accepting POST body is simply that it makes it very simple to call the service from curl: curl http://example.com -d "string here" So there's no reason these kinds of services would be disappearing. Let's continue the discussion in the bug tracker: http://bugs.python.org/issue26045 /Emil

On Thu, Jan 7, 2016, at 07:59, Steven D'Aprano wrote:
Sure, but it includes a content-type header with a charset parameter. I think the behavior of encoding text but not including a charset parameter is fundamentally broken. If the user supplies a charset parameter, it should try to use the matching encoding, otherwise it should pick an encoding (whether that is "always UTF-8" or some other rule) and add the charset parameter.

On 7 Jan 2016, at 09:20, Emil Stenström <em@kth.se> wrote:
Since RFC 2616 says latin-1 is the default encoding http.client tries that and fails with a UnicodeEncodeError.
I cannot stress this enough: there is *no* default encoding for HTTP bodies! This conversation is very confused, and it all starts because of a thoroughly misleading comment in http.client. Firstly, let’s all remember that RFC 2616 is dead (hurrah!), now superseded by RFCs 7230 through 7238. However, http.client blames its decision on RFC 2616. Note the comment here[0]. This is (in my view) a *misreading* of RFC 2616 Section 3.7.1, which says:
The thing is, this paragraph is referring to MIME types: that is, when the Content-Type header reads “text/<something>”, and specifies no charset parameter, the body should be encoded in UTF-8. That, of course, is not the invariant this code enforces. Instead, this code spots the *only* explicit reference to a text encoding and chooses to use it for any unicode string sent by the user. That’s a somewhat defensible decision, though it’s not the one I’d have made. *However*, that fallback was removed in RFC 7231. In appendix B of that RFC, we see this note:
This means there is no longer a default content encoding for HTTP, and instead the default encoding varies based on media type. The relevant RFC for this is RFC 6657, which specifies the following things: - The default encoding for text/plain is US-ASCII - All other text subtypes either MUST provide a charset parameter that explicitly indicates what their encoding is, or MUST NOT provide one under any circumstances and instead carry that information in their contents (e.g. HTML, XML). That is to say, there are no defaults for text/* encodings: only explicit encoding choices! This whole thing was really very confusing from the beginning. IMO, the only safe decision is for http.client to simply refuse to accept unicode strings *at all* as request bodies: the ambiguity over what they mean is simply too great. Requests has had a large number of bug reports from people who claimed that something “didn’t work”, when in practice there was just a disagreement over what the correct encoding of something was. And having written both a HTTP/1.1 and a HTTP/2 client myself, in both cases I restricted the arguments of HTTPConnection.send() to bytestrings. For what it’s worth, I don’t believe it’s a good idea to change the default body encoding of unicode strings. This is the kind of really perplexing change that takes working code that implicitly relies on this behaviour and breaks it. In my experience, breakage of this manner is particularly tricky to catch because anything that can be validly encoded as Latin-1 can be validly encoded as UTF-8, so the failure will manifest as request failures rather than tracebacks. In this instance I believe the http.client module has made its bed, and will need to lie in it. If this *did* change, Requests would (at least for the remainder of the 2.X release cycle) need to enforce the Latin-1 behaviour itself for the very same backward compatibility reasons, which removes any benefit we’d get from this anyway. The really correct behaviour would be to tell users they cannot send unicode strings, because it makes no sense. That’s a change I could get behind. But moving from one guess to another, even though the new guess is more likely to be right, seems to me to be misunderstanding the problem. Cory N.B: I should note that only one of the linked requests issues, #2838, is actually about the request body. Of the others, one is about unicode in the request URI and one is about unicode in header values. This set of related issues demonstrates an ongoing confusion amongst users about what unicode strings are and how they work, but that’s a separate discussion to this one. [0]: https://github.com/python/cpython/blob/master/Lib/http/client.py#L1173-L1176

On 7 January 2016 at 09:20, Emil Stenström <em@kth.se> wrote:
In a Unix shell, this would be supplying a bytestring argument to the curl exe, that encoded the characters in whatever language setting the user had specified (likely UTF-8). In Windows Powershell (the only Windows shell I can think of that would support Unicode) what would happen would depend on how curl accessed its command line. This probably relies on which specific CRT the code was built with.
I don't know how Ruby handles Unicode, but would that body argument *actually* be Unicode, or would it be a UTF-8 encoded bytestring? I have a vague recollection that Ruby uses a "utf-8 for internal string encodings" model, which may mean it's not as strict as Python 3 is about separating bytestrings and Unicode strings...
Same response here as for Ruby. It depends on the semantics of the language regarding Unicode support as to what's happening here.
What does the requests documentation say it'll do with a Unicode string being passed as POST data to a request where there's no encoding? If it says it'll encode as latin-1, then that error is entirely correct. If it says it'll encode in some other encoding, then it isn't doing so (and that's a requests bug). If it's not explaining what it's doing, then the requests documentation is doing its users a disservice by not explaining the realities of sending Unicode over a byte-oriented protocol - and it's also leaving a huge "undefined behaviour" hole that people are falling into. I understand that beginners are confused by the apparent problem that other environments "just work", but they really don't - and the problems will hit the user further down the line, when the issue is harder to debug. For example, you're completely ignoring the potential issue of what the target server will do when faced with UTF-8 data - there's no guarantee that it will work in general. So IMO, this needs to be addressed as a documentation (and possibly code) fix in requests. It's something of a shame that httplib.client doesn't reject Unicode strings rather than making a silent assumption of the encoding, but that's something we have to live with for backward compatibility reasons. But there's no reason requests has to expose that behaviour to the user. Paul

On Thu, Jan 7, 2016 at 10:37 PM, Paul Moore <p.f.moore@gmail.com> wrote:
So IMO, this needs to be addressed as a documentation (and possibly code) fix in requests. It's something of a shame that httplib.client doesn't reject Unicode strings rather than making a silent assumption of the encoding, but that's something we have to live with for backward compatibility reasons. But there's no reason requests has to expose that behaviour to the user.
Personally, I would be happy with any of three behaviours: 1) Raise TypeError and demand that byte strings be used 2) Encode as UTF-8, since that's most likely to "just work", and is also consistent 3) Encode as ASCII, and let any errors bubble up. But, backward compat. ChrisA

On 7 January 2016 at 11:53, Chris Angelico <rosuav@gmail.com> wrote:
3) Encode as ASCII, and let any errors bubble up.
4) Encode as ASCII and catch UnicodeEncodeError and re-raise as a TypeError "Unicode string supplied without an explicit encoding". IMO, the underlying encoding errors are very user-unfriendly, and should nearly always be caught internally and replaced with something more user friendly. Most of the user confusion I see from Unicode issues could probably be significantly alleviated if the user was presented with something better than a raw (en/de)coding error and traceback. Paul

On Thu, Jan 7, 2016, at 06:53, Chris Angelico wrote:
What about: 4) Silently add a content type (default text/plain; charset=UTF-8) or charset (if the user has specified a content type without one) if a unicode string is used. If a byte string is used, use application/octet-stream for the default content type and don't add a charset in any case (even if the user-specified content type is text/*)

On Jan 7, 2016, at 01:20, Emil Stenström <em@kth.se> wrote:
This is also how other languages http libraries seem to deal with this, sending in unicode just works:
No, sending Unicode as UTF-8 doesn't "just work", except when the server is expecting UTF-8. Otherwise, it just makes the problem harder to debug. Most commonly, people who run into this problem with requests are trying to send JSON or form-encoded data. In either case, the solution is simple: just pass the object to the json= or data= parameter. It's only if you try to do it half-way yourself, calling json.dumps but then not calling .encode, that you run into a problem. I've also seen people run into this uploading files. Again, if you let requests just take care of it for you (by passing it the filename or file object), it just works. But if you try to do it half-way, reading the whole file into memory as a string but not encoding it, that's when you have problems. The solution in every case is simple: don't make things harder for yourself by doing extra work and then trying to use the lower-level API, just let requests do it for you. Of course if you're using http.client or urllib instead of requests, you don't have that option. But if http.client is too low-level for you, the solution isn't to hack up http.client to be more magical when used by people who don't know what they're doing in hopes that it'll work more often than it'll cause further and harder-to-debug problems, it's to tell them to use requests if they don't want to learn what they're doing.
participants (8)
-
Andrew Barnert
-
Chris Angelico
-
Cory Benfield
-
Emil Stenström
-
Guido van Rossum
-
Paul Moore
-
Random832
-
Steven D'Aprano