xhtml encoding question

Thu Feb 2 02:02:06 EST 2012

Tim Arnold, 01.02.2012 19:15:
> On 2/1/2012 3:26 AM, Stefan Behnel wrote:
>> Tim Arnold, 31.01.2012 19:09:
>>> I have to follow a specification for producing xhtml files.
>>> The original files are in cp1252 encoding and I must reencode them to
>>> utf-8.
>>> Also, I have to replace certain characters with html entities.
>>> ---------------------------------
>>> import codecs, StringIO
>>> from lxml import etree
>>> high_chars = {
>>>     0x2014:'—', # 'EM DASH',
>>>     0x2013:'–', # 'EN DASH',
>>>     0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>>>     0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>>>     0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>>>     0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>>>     0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>>>     0x2122:'™', # 'TRADE MARK SIGN',
>>>     0x00A9:'©',  # 'COPYRIGHT SYMBOL',
>>>     }
>>> def translate(string):
>>>     s = ''
>>>     for c in string:
>>>         if ord(c) in high_chars:
>>>             c = high_chars.get(ord(c))
>>>         s += c
>>>     return s
>>
>> I hope you are aware that this is about the slowest possible algorithm
>> (well, the slowest one that doesn't do anything unnecessary). Since none of
>> this is required when parsing or generating XHTML, I assume your spec tells
>> you that you should do these replacements?
>
> I wasn't aware of it, but I am now--code's embarassing now.
> The spec I must follow forces me to do the translation.
> 
> I am actually working with html not xhtml; which makes a huge difference,

We all learn.

> Ulrich's line of code for translate is elegant.
> for c in string:
>     s += high_chars.get(c,c)

Still not efficient because it builds the string one character at a time
and needs to reallocate (and potentially copy) the string buffer quite
frequently in order to do that. You are lucky with CPython, because it has
an internal optimisation that mitigates this overhead on some platforms.
Other Python implementations don't have that, and even the optimisation in
CPython is platform specific (works well on Linux, for example).

Peter Otten presented the a better way of doing it.

> From the all the info I've received on this thread, plus some additional
> reading, I think I need the following code.
> 
> Use the HTMLParser because the source files are actually HTML, and use
> output from etree.tostring() as input to translate() as the very last step.
> 
> def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
>     parser = etree.HTMLParser(encoding=in_encoding)
>     tree = etree.parse(filename, parser)
>     result = etree.tostring(tree.getroot(), method='html',
>                             pretty_print=True,
>                             encoding=out_encoding)
>     with open(filename, 'wb') as f:
>         f.write(translate(result))
> 
> not simply tree.write(f...) because I have to do the translation at the
> end, so I get the entities instead of the resolved entities from lxml.

Yes, that's better.

Still one thing (since you didn't show us your final translate() function):
you do the character escaping on a UTF-8 encoded string and made the
encoding configurable. That means that the characters you are looking for
must also be encoded with the same encoding in order to find matches.
However, if you ever choose a different target encoding that doesn't have
the nice properties of UTF-8's byte sequences, you may end up with
ambiguous byte sequences in the output that your translate() function
accidentally matches on, thus potentially corrupting your data.

Assuming that you are using Python 2, you may even be accidentally doing
the replacement using Unicode character strings, which then only happens to
work on systems that use UTF-8 as their default encoding. Python 3 has
fixed this trap, but you have to take care to avoid it in Python 2.

I'd prefer serialising the documents into a unicode string
(encoding='unicode'), then post-processing that and finally encoding it to
the target encoding when writing it out. But you'll have to see how that
works out together with your escaping step, and also how it impacts the
HTML <meta> tag that states the document encoding.

Stefan