
I have to follow a specification for producing xhtml files. The original files are in cp1252 encoding and I must reencode them to utf-8. Also, I have to replace certain characters with html entities. I think I've got this right, but I'd like to hear if there's something I'm doing that is dangerous or wrong. Please see the appended code, and thanks for any comments. I have two functions, translate (replaces high characters with entities) and reencode (um, reencodes): --------------------------------- import codecs, StringIO from lxml import etree high_chars = { 0x2014:'—', # 'EM DASH', 0x2013:'–', # 'EN DASH', 0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON', 0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK', 0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK', 0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK', 0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK', 0x2122:'™', # 'TRADE MARK SIGN', 0x00A9:'©', # 'COPYRIGHT SYMBOL', } def translate(string): s = '' for c in string: if ord(c) in high_chars: c = high_chars.get(ord(c)) s += c return s def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'): with codecs.open(filename,encoding=in_encoding) as f: s = f.read() sio = StringIO.StringIO(translate(s)) parser = etree.HTMLParser(encoding=in_encoding) tree = etree.parse(sio, parser) result = etree.tostring(tree.getroot(), method='html', pretty_print=True, encoding=out_encoding) with open(filename,'wb') as f: f.write(result) if __name__ == '__main__': fname = 'mytest.htm' reencode(fname)
participants (1)
-
Tim Arnold