xhtml encoding question

Ulrich Eckhardt ulrich.eckhardt at dominolaser.com
Wed Feb 1 03:39:08 EST 2012


Am 31.01.2012 19:09, schrieb Tim Arnold:
> high_chars = {
>     0x2014:'—', # 'EM DASH',
>     0x2013:'–', # 'EN DASH',
>     0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>     0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>     0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>     0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>     0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>     0x2122:'™', # 'TRADE MARK SIGN',
>     0x00A9:'©', # 'COPYRIGHT SYMBOL',
> }

You could use Unicode string literals directly instead of using the 
codepoint, making it a bit more self-documenting and saving you the 
later call to ord():

high_chars = {
     u'\u2014': '—',
     u'\u2013': '–',
     ...
}

> for c in string:
>     if ord(c) in high_chars:
>         c = high_chars.get(ord(c))
>     s += c
> return s

Instead of checking if there is a replacement and then looking up the 
replacement again, just use the default:

   for c in string:
       s += high_chars.get(c, c)

Alternatively, if you find that clearer, you could also check if the 
returnvalue of get() is None to find out if there is a replacement:

   for c in string:
       r = high_chars.get(c)
       if r is None:
           s += c
       else:
           s += r


Uli




More information about the Python-list mailing list