xhtml encoding question

Wed Feb 1 13:15:09 EST 2012

On 2/1/2012 3:26 AM, Stefan Behnel wrote:
> Tim Arnold, 31.01.2012 19:09:
>> I have to follow a specification for producing xhtml files.
>> The original files are in cp1252 encoding and I must reencode them to utf-8.
>> Also, I have to replace certain characters with html entities.
>> ---------------------------------
>> import codecs, StringIO
>> from lxml import etree
>> high_chars = {
>>     0x2014:'—', # 'EM DASH',
>>     0x2013:'–', # 'EN DASH',
>>     0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>>     0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>>     0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>>     0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>>     0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>>     0x2122:'™', # 'TRADE MARK SIGN',
>>     0x00A9:'©',  # 'COPYRIGHT SYMBOL',
>>     }
>> def translate(string):
>>     s = ''
>>     for c in string:
>>         if ord(c) in high_chars:
>>             c = high_chars.get(ord(c))
>>         s += c
>>     return s
>
> I hope you are aware that this is about the slowest possible algorithm
> (well, the slowest one that doesn't do anything unnecessary). Since none of
> this is required when parsing or generating XHTML, I assume your spec tells
> you that you should do these replacements?
>
I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.

I am actually working with html not xhtml; which makes a huge 
difference, sorry for that.

Ulrich's line of code for translate is elegant.
for c in string:
     s += high_chars.get(c,c)

>
>> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>>     with codecs.open(filename,encoding=in_encoding) as f:
>>         s = f.read()
>>     sio = StringIO.StringIO(translate(s))
>>     parser = etree.HTMLParser(encoding=in_encoding)
>>     tree = etree.parse(sio, parser)
>
> Yes, you are doing something dangerous and wrong here. For one, you are
> decoding the data twice. Then, didn't you say XHTML? Why do you use the
> HTML parser to parse XML?
>
I see that I'm decoding twice now, thanks.

Also, I now see that when lxml writes the result back out the entities I 
got from my translate function are resolved, which defeats the whole 
purpose.
>
>>     result = etree.tostring(tree.getroot(), method='html',
>>                             pretty_print=True,
>>                             encoding=out_encoding)
>>     with open(filename,'wb') as f:
>>         f.write(result)
>
> Use tree.write(f, ...)

 From the all the info I've received on this thread, plus some 
additional reading, I think I need the following code.

Use the HTMLParser because the source files are actually HTML, and use 
output from etree.tostring() as input to translate() as the very last step.

def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
     parser = etree.HTMLParser(encoding=in_encoding)
     tree = etree.parse(filename, parser)
     result = etree.tostring(tree.getroot(), method='html',
                             pretty_print=True,
                             encoding=out_encoding)
     with open(filename, 'wb') as f:
         f.write(translate(result))

not simply tree.write(f...) because I have to do the translation at the 
end, so I get the entities instead of the resolved entities from lxml.

Again, it would be simpler if this was xhtml, but I misspoke 
(mis-wrote?) when I said xhtml; this is for html.

> Assuming you really meant XHTML and not HTML, I'd just drop your entire
> code and do this instead:
>
>    tree = etree.parse(in_path)
>    tree.write(out_path, encoding='utf8', pretty_print=True)
>
> Note that I didn't provide an input encoding. XML is safe in that regard.
>
> Stefan
>

thanks everyone for the help.

--Tim Arnold