Fall back to encoding unicode strings in utf-8 if latin-1 fails in http.client
Hi, I hope python-ideas is the right place to post this, I'm very new to this and appreciate a pointer in the right direction if this is not it. The requests project is getting multiple bug reports about a problem in the stdlib http.client, so I thought I'd raise an issue about it here. The bug reports concern people posting http requests with unicode strings when they should be using utf-8 encoded strings. Since RFC 2616 says latin-1 is the default encoding http.client tries that and fails with a UnicodeEncodeError. My idea is NOT to change from latin-1 to something else, that would break compliance with the spec, but instead catch that exception, and try encoding with utf-8 instead. That would avoid breaking backward compatibility, unless someone specifically relied on that exception, which I think is very unlikely. This is also how other languages http libraries seem to deal with this, sending in unicode just works: In cURL (works fine): curl http://example.com -d "Celebrate 🎉" In Ruby with http.rb (works fine): require 'http' r = HTTP.post("http://example.com", :body => "Celebrate 🎉) In Node with request (works fine): var request = require('request'); request.post({url: 'http://example.com', body: "Celebrate 🎉"}, function (error, response, body) { console.log(body) }) But Python 3 with requests crashes instead: import requests r = requests.post("http://localhost:8000/tag", data="Celebrate 🎉") ...with the following stacktrace: ... File "../lib/python3.4/http/client.py", line 1127, in _send_request body = body.encode('iso-8859-1') UnicodeEncodeError: 'latin-1' codec can't encode characters in position 14-15: ordinal not in range(256) ---- So the rationale for this idea is: * http.client doesn't work the way beginners expect for very basic usecases (posting unicode strings) * Libraries in other languages behave like beginners expect, which magnifies the problem. * Changing the default latin-1 encoding probably isn't possible, because it would break the spec... * But catching the exception and try encoding in utf-8 instead wouldn't break the spec and solves the problem. ---- Here's a couple of issues where people expect things to work differently: https://github.com/kennethreitz/requests/issues/1926 https://github.com/kennethreitz/requests/issues/2838 https://github.com/kennethreitz/requests/issues/1822 ---- Does this make sense? /Emil
participants (8)
-
Andrew Barnert
-
Chris Angelico
-
Cory Benfield
-
Emil Stenström
-
Guido van Rossum
-
Paul Moore
-
Random832
-
Steven D'Aprano