[Web-SIG] Unicode in Python 3
renesd at gmail.com
Sat Sep 19 15:10:29 CEST 2009
On Sat, Sep 19, 2009 at 1:34 PM, Armin Ronacher
<armin.ronacher at active-4.com> wrote:
> René Dudfield schrieb:
>> I think that shows that they are being handled differently depending
>> on type. Which is against polymorphism... but some people prefer to
>> have separate functions for different types(in and out). I don't
>> think other python functions do this though. So maybe this is a one
>> off, and could be considered a bug... I'm not sure why they did it
>> this way.
> The fact that urldecode and urlparse does not provide a byte-only
> implementation is something I would consider a bug. After all that
> module is called "urlparse" and not "iriparse".
I think they should work on buffers too. Since that's one of the
types sockets support.
>> Here is a snippet from the compat.py we used to port pygame to support
>> python2.3 through 3.1
> How is that related?
Rather than using a 2to3 tool - which then makes you have two versions
of your code, making the code work in python 2.x and 3.x. 2to3
outputs python2.x incompatible code - when it doesn't have to.
>> Arguments against using bytes (and using unicode instead).
>> So I'm -1 on using b'' all over the place since it's not in both
>> versions of python, and makes it impossible for code bases to share
>> the same code for multiple versions of python.
> That would not matter much because the high-level applications never see
> what's under the hood. Besides web2py all frameworks and libraries I
> know about are using unicode internally anyways.
It would mean code bases need to support b'' - which is not compatible
with python2. This makes it harder to port, as it restricts people to
having separate code bases for each language. This is not possible
for some code bases since it double the maintenance burden.
Convincing people to port to python3 is already hard enough.
>> Argument for using bytes:
> There are many more. It's suppose to be byte based everywhere because
> that's how these protocols work. There is no magic unicode layer in
> HTTP that solves all of our problems.
> - URLs are byte based, URLs are untrusted
> - WSGI 1.0 was byte based, API wise that means the smallest change
> - Frameworks don't have to be totally rewritten because they already
> have their own unicode conversion functions.
> - Except the application, nothing knows about the real encoding
I'm advocating having two keys... one unicode and a raw buffer version of keys.
- unicode because everyone is using unicode these days anyway (the web
browsers, and most upper layer frameworks)
- buffer for raw data as you need it sometimes and writing performant
wsgi apps becomes a lot more possible. This raw buffer can be marked
with any relevant encoding if needed (eg, what the browser suggests it
is, and what the server suggests it is).
> Graham's suggestion for URL encodings means that the URL encoding would
> ahve to be passed to the WSGI server from outside (he proposed the
> apache config as an example). This means that the application behavior
> will change based on the server configuration, causing even more confusion.
I'm not sure what this particular suggestion this is? Having wsgi
apps behave the same with different servers is one of it's main points
- so if that's the case that's not a good idea.
> Let us ignore 2to3 and syntax problem for a minute. These are a lot
> less complex than the actual encoding problems. Also it is very, very
> unlikely that applications will be able to go through 2to3 and continue
> to work because there is just too much stuff that changes. b'' vs '' is
> really the smallest issue we have with WSGI currently. Change behavior
> of the bytes object and a semi-unicode aware standard library are the
> biggest problems in my opinion.
Well, this thread is about python3 issues. I think there's enough
people who want to consider the python3 issues to not ignore it.
More information about the Web-SIG