[Python-Dev] urllib unicode handling
Kristján Valur Jónsson
kristjan at ccpgames.com
Wed May 7 14:00:51 CEST 2008
> -----Original Message-----
> From: python-dev-bounces+kristjan=ccpgames.com at python.org
> [mailto:python-dev-bounces+kristjan=ccpgames.com at python.org] On Behalf
> Of Jeroen Ruigrok van der Werven
> Sent: Wednesday, May 07, 2008 05:20
> To: Tom Pinckney
> Cc: python-dev at python.org
> Subject: Re: [Python-Dev] urllib unicode handling
>
> -On [20080507 04:06], Tom Pinckney (thomaspinckney3 at gmail.com) wrote:
> >While in theory UTF-8 is not a standard, sites like Last.fm, Facebook
> and
> >Wikipedia seem to have embraced it (as have pretty much all other
> major web
> >sites). As with HTML, there is what the standard says and what the
> actual
> >browsers have to accept in order to work in the real world.
>
FYI, here is how we have patched urrlib2 for use in EVE:
--- C:\p4\sdk\stackless25\Lib\urllib.py 2008-03-21 14:47:23.000000000 -0000
+++ C:\p4\eve\KALI\common\stdlib\urllib.py 2007-11-06 11:18:01.000000000 -0000
@@ -1158,12 +1158,29 @@
except KeyError:
res[i] = '%' + item
except UnicodeDecodeError:
res[i] = unichr(int(item[:2], 16)) + item[2:]
return "".join(res)
+unquote_inner = unquote
+def unquote(s):
+ """CCP attempt at making sensible choices in unicode quoteing / unquoting """
+ s = unquote_inner(s)
+ try:
+ u = s.decode("utf-8")
+ try:
+ s2 = s.decode("ascii")
+ except UnicodeDecodeError:
+ s = u #yes, s was definitely utf8, which isn't pure ascii
+ else:
+ if u != s:
+ s = u
+ except UnicodeDecodeError:
+ pass #can't have been utf8
+ return s
+
def unquote_plus(s):
"""unquote('%7e/abc+def') -> '~/abc def'"""
s = s.replace('+', ' ')
return unquote(s)
always_safe = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'
@@ -1201,12 +1218,20 @@
for i in range(256):
c = chr(i)
safe_map[c] = (c in safe) and c or ('%%%02X' % i)
_safemaps[cachekey] = safe_map
res = map(safe_map.__getitem__, s)
return ''.join(res)
+
+quote_inner = quote
+def quote(s, safe = '/'):
+ """CCP addition, to try to sensibly support / circumvent issues with unicode in urls"""
+ try:
+ return quote_inner(s, safe)
+ except KeyError:
+ return quote_inner(s.encode("utf-8", safe))
def quote_plus(s, safe = ''):
"""Quote the query fragment of a URL; replacing ' ' with '+'"""
if ' ' in s:
s = quote(s, safe + ' ')
return s.replace(' ', '+')
More information about the Python-Dev
mailing list