Re: [lxml-dev] problem about lxml encoding

May 31, 2009

      Hi,

please keep the list involved. Thanks.

qhlonline wrote:
...
Since I am processing Chinese Webs. There are instances that some Webs
are not regular. When it said <meta encoding="GB2312">, we can't decode
the HTML string with GB2312 decoder, we found that this content is
encoded with GBK or GB18030 in fact. But the lxml parser will process it
according to the meta declaration. I have read the source of libxml2,
there are no "encoding" string match "GB2312","GBK" or "GB18030", but
some other encodings like ENCODING_2022_JP which may be a super set of
GB2312 I just don't know where in libxml2 or in lxml the <meta> declared
GB2312 encoding is converted to some other encodings that apparently
supported by libxml2's XML's xmlCharEncoding structure, and How?
Most encodings are handled by libiconv. libxml2 only handles the 'normal'
(or most common) XML encodings in its own code.
...
I think
since "GB18030" is a super set of "GB2312", if we change the lxml source
to let all <meta charset = 'GB2312'> strings were decoded with GB18030
codec, then there will be no error even if the HTML file is not regular.
No need to play with lxml's sources here. What should work is to read the
HTML page into a byte string, decode it manually into a unicode string, and
then let lxml parse that. That way, you also have full control over the
decoding and can handle any decoding errors yourself.

Stefan

Re: [lxml-dev] problem about lxml encoding

Stefan Behnel