
Hi, please keep the list involved. Thanks. qhlonline wrote:
Since I am processing Chinese Webs. There are instances that some Webs are not regular. When it said <meta encoding="GB2312">, we can't decode the HTML string with GB2312 decoder, we found that this content is encoded with GBK or GB18030 in fact. But the lxml parser will process it according to the meta declaration. I have read the source of libxml2, there are no "encoding" string match "GB2312","GBK" or "GB18030", but some other encodings like ENCODING_2022_JP which may be a super set of GB2312 I just don't know where in libxml2 or in lxml the <meta> declared GB2312 encoding is converted to some other encodings that apparently supported by libxml2's XML's xmlCharEncoding structure, and How?
Most encodings are handled by libiconv. libxml2 only handles the 'normal' (or most common) XML encodings in its own code.
I think since "GB18030" is a super set of "GB2312", if we change the lxml source to let all <meta charset = 'GB2312'> strings were decoded with GB18030 codec, then there will be no error even if the HTML file is not regular.
No need to play with lxml's sources here. What should work is to read the HTML page into a byte string, decode it manually into a unicode string, and then let lxml parse that. That way, you also have full control over the decoding and can handle any decoding errors yourself. Stefan