URLs and ampersands
Duncan Booth
duncan.booth at invalid.invalid
Wed Aug 6 03:41:12 EDT 2008
Matthew Woodcraft <mattheww at chiark.greenend.org.uk> wrote:
> Gabriel Genellina wrote:
>> Steven D'Aprano wrote:
>
>>> I have searched for, but been unable to find, standard library
>>> functions that escapes or unescapes URLs. Are there any such
>>> functions?
>
>> Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
>
> I don't see a cgi.unescape in the standard library.
>
> I don't think xml.sax.saxutils.unescape will be suitable for Steven's
> purpose, because it doesn't process numeric character references (which
> are both legal and seen in the wild in /href/ attributes).
>
Here's the code I use. It handles decimal and hex entity references as well
as all html named entities.
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)
if isinstance(s, str):
s = s.decode(encoding)
return EntityPattern.sub(unescape, s)
--
Duncan Booth http://kupuguy.blogspot.com
More information about the Python-list
mailing list