[Python-Dev] urllib unicode handling

Kristján Valur Jónsson kristjan at ccpgames.com
Wed May 7 14:00:51 CEST 2008


> -----Original Message-----
> From: python-dev-bounces+kristjan=ccpgames.com at python.org
> [mailto:python-dev-bounces+kristjan=ccpgames.com at python.org] On Behalf
> Of Jeroen Ruigrok van der Werven
> Sent: Wednesday, May 07, 2008 05:20
> To: Tom Pinckney
> Cc: python-dev at python.org
> Subject: Re: [Python-Dev] urllib unicode handling
>
> -On [20080507 04:06], Tom Pinckney (thomaspinckney3 at gmail.com) wrote:
> >While in theory UTF-8 is not a standard, sites like Last.fm, Facebook
> and
> >Wikipedia seem to have embraced it (as have pretty much all other
> major web
> >sites). As with HTML, there is what the standard says and what the
> actual
> >browsers have to accept in order to work in the real world.
>

FYI, here is how we have patched urrlib2 for use in EVE:

--- C:\p4\sdk\stackless25\Lib\urllib.py 2008-03-21 14:47:23.000000000 -0000
+++ C:\p4\eve\KALI\common\stdlib\urllib.py      2007-11-06 11:18:01.000000000 -0000
@@ -1158,12 +1158,29 @@
         except KeyError:
             res[i] = '%' + item
         except UnicodeDecodeError:
             res[i] = unichr(int(item[:2], 16)) + item[2:]
     return "".join(res)

+unquote_inner = unquote
+def unquote(s):
+    """CCP attempt at making sensible choices in unicode quoteing / unquoting """
+    s = unquote_inner(s)
+    try:
+        u = s.decode("utf-8")
+        try:
+            s2 = s.decode("ascii")
+        except UnicodeDecodeError:
+            s = u #yes, s was definitely utf8, which isn't pure ascii
+        else:
+            if u != s:
+                s = u
+    except UnicodeDecodeError:
+        pass  #can't have been utf8
+    return s
+
 def unquote_plus(s):
     """unquote('%7e/abc+def') -> '~/abc def'"""
     s = s.replace('+', ' ')
     return unquote(s)

 always_safe = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'
@@ -1201,12 +1218,20 @@
         for i in range(256):
             c = chr(i)
             safe_map[c] = (c in safe) and c or ('%%%02X' % i)
         _safemaps[cachekey] = safe_map
     res = map(safe_map.__getitem__, s)
     return ''.join(res)
+
+quote_inner = quote
+def quote(s, safe = '/'):
+    """CCP addition, to try to sensibly support / circumvent issues with unicode in urls"""
+    try:
+        return quote_inner(s, safe)
+    except KeyError:
+        return quote_inner(s.encode("utf-8", safe))

 def quote_plus(s, safe = ''):
     """Quote the query fragment of a URL; replacing ' ' with '+'"""
     if ' ' in s:
         s = quote(s, safe + ' ')
         return s.replace(' ', '+')


More information about the Python-Dev mailing list