On 7 January 2016 at 09:20, Emil Stenström <em@kth.se> wrote:
This is also how other languages http libraries seem to deal with this, sending in unicode just works:

In cURL (works fine):
curl http://example.com -d "Celebrate 🎉"

In a Unix shell, this would be supplying a bytestring argument to the curl exe, that encoded the characters in whatever language setting the user had specified (likely UTF-8).

In Windows Powershell (the only Windows shell I can think of that would support Unicode) what would happen would depend on how curl accessed its command line. This probably relies on which specific CRT the code was built with.
 
In Ruby with http.rb (works fine):
require 'http'
r = HTTP.post("http://example.com", :body => "Celebrate 🎉)

I don't know how Ruby handles Unicode, but would that body argument *actually* be Unicode, or would it be a UTF-8 encoded bytestring? I have a vague recollection that Ruby uses a "utf-8 for internal string encodings" model, which may mean it's not as strict as Python 3 is about separating bytestrings and Unicode strings...
 
In Node with request (works fine):
var request = require('request');
request.post({url: 'http://example.com', body: "Celebrate 🎉"}, function (error, response, body) {
    console.log(body)
})

Same response here as for Ruby. It depends on the semantics of the language regarding Unicode support as to what's happening here.
 
But Python 3 with requests crashes instead:
import requests
r = requests.post("http://localhost:8000/tag", data="Celebrate 🎉")
...with the following stacktrace:
...
  File "../lib/python3.4/http/client.py", line 1127, in _send_request
    body = body.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 14-15: ordinal not in range(256)

What does the requests documentation say it'll do with a Unicode string being passed as POST data to a request where there's no encoding? If it says it'll encode as latin-1, then that error is entirely correct. If it says it'll encode in some other encoding, then it isn't doing so (and that's a requests bug). If it's not explaining what it's doing, then the requests documentation is doing its users a disservice by not explaining the realities of sending Unicode over a byte-oriented protocol - and it's also leaving a huge "undefined behaviour" hole that people are falling into.

I understand that beginners are confused by the apparent problem that other environments "just work", but they really don't - and the problems will hit the user further down the line, when the issue is harder to debug. For example, you're completely ignoring the potential issue of what the target server will do when faced with UTF-8 data - there's no guarantee that it will work in general.

So IMO, this needs to be addressed as a documentation (and possibly code) fix in requests. It's something of a shame that httplib.client doesn't reject Unicode strings rather than making a silent assumption of the encoding, but that's something we have to live with for backward compatibility reasons. But there's no reason requests has to expose that behaviour to the user.

Paul