xhtml encoding question

Tim Arnold Tim.Arnold at sas.com
Tue Jan 31 13:09:53 EST 2012


I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.

I think I've got this right, but I'd like to hear if there's something 
I'm doing that is dangerous or wrong.

Please see the appended code, and thanks for any comments or suggestions.

I have two functions, translate (replaces high characters with entities) 
and reencode (um, reencodes):
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
    0x2014:'—', # 'EM DASH',
    0x2013:'–', # 'EN DASH',
    0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
    0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
    0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
    0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
    0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
    0x2122:'™', # 'TRADE MARK SIGN',
    0x00A9:'©',  # 'COPYRIGHT SYMBOL',
    }
def translate(string):
    s = ''
    for c in string:
        if ord(c) in high_chars:
            c = high_chars.get(ord(c))
        s += c
    return s

def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
    with codecs.open(filename,encoding=in_encoding) as f:
        s = f.read()
    sio = StringIO.StringIO(translate(s))
    parser = etree.HTMLParser(encoding=in_encoding)
    tree = etree.parse(sio, parser)
    result = etree.tostring(tree.getroot(), method='html',
                            pretty_print=True,
                            encoding=out_encoding)
    with open(filename,'wb') as f:
        f.write(result)

if __name__ == '__main__':
    fname = 'mytest.htm'
    reencode(fname)



More information about the Python-list mailing list