[Web-SIG] Unicode in Python 3
armin.ronacher at active-4.com
Sat Sep 19 14:34:06 CEST 2009
René Dudfield schrieb:
> I think that shows that they are being handled differently depending
> on type. Which is against polymorphism... but some people prefer to
> have separate functions for different types(in and out). I don't
> think other python functions do this though. So maybe this is a one
> off, and could be considered a bug... I'm not sure why they did it
> this way.
The fact that urldecode and urlparse does not provide a byte-only
implementation is something I would consider a bug. After all that
module is called "urlparse" and not "iriparse".
> Here is a snippet from the compat.py we used to port pygame to support
> python2.3 through 3.1
How is that related?
> Arguments against using bytes (and using unicode instead).
> So I'm -1 on using b'' all over the place since it's not in both
> versions of python, and makes it impossible for code bases to share
> the same code for multiple versions of python.
That would not matter much because the high-level applications never see
what's under the hood. Besides web2py all frameworks and libraries I
know about are using unicode internally anyways.
> Argument for using bytes:
There are many more. It's suppose to be byte based everywhere because
that's how these protocols work. There is no magic unicode layer in
HTTP that solves all of our problems.
- URLs are byte based, URLs are untrusted
- WSGI 1.0 was byte based, API wise that means the smallest change
- Frameworks don't have to be totally rewritten because they already
have their own unicode conversion functions.
- Except the application, nothing knows about the real encoding
Graham's suggestion for URL encodings means that the URL encoding would
ahve to be passed to the WSGI server from outside (he proposed the
apache config as an example). This means that the application behavior
will change based on the server configuration, causing even more confusion.
Let us ignore 2to3 and syntax problem for a minute. These are a lot
less complex than the actual encoding problems. Also it is very, very
unlikely that applications will be able to go through 2to3 and continue
to work because there is just too much stuff that changes. b'' vs '' is
really the smallest issue we have with WSGI currently. Change behavior
of the bytes object and a semi-unicode aware standard library are the
biggest problems in my opinion.
More information about the Web-SIG