xhtml encoding question
Peter Otten
__peter__ at web.de
Wed Feb 1 04:32:52 EST 2012
Ulrich Eckhardt wrote:
> Am 31.01.2012 19:09, schrieb Tim Arnold:
>> high_chars = {
>> 0x2014:'—', # 'EM DASH',
>> 0x2013:'–', # 'EN DASH',
>> 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>> 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>> 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>> 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>> 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>> 0x2122:'™', # 'TRADE MARK SIGN',
>> 0x00A9:'©', # 'COPYRIGHT SYMBOL',
>> }
>
> You could use Unicode string literals directly instead of using the
> codepoint, making it a bit more self-documenting and saving you the
> later call to ord():
>
> high_chars = {
> u'\u2014': '—',
> u'\u2013': '–',
> ...
> }
>
>> for c in string:
>> if ord(c) in high_chars:
>> c = high_chars.get(ord(c))
>> s += c
>> return s
>
> Instead of checking if there is a replacement and then looking up the
> replacement again, just use the default:
>
> for c in string:
> s += high_chars.get(c, c)
>
> Alternatively, if you find that clearer, you could also check if the
> returnvalue of get() is None to find out if there is a replacement:
>
> for c in string:
> r = high_chars.get(c)
> if r is None:
> s += c
> else:
> s += r
It doesn't matter for the OP (see Stefan Behnel's post), but If you want to
replace characters in a unicode string the best way is probably the
translate() method:
>>> print u"\xa9\u2122"
©™
>>> u"\xa9\u2122".translate({0xa9: u"©", 0x2122: u"™"})
u'©™'
More information about the Python-list
mailing list