Hi,
please keep the list involved. Thanks.
qhlonline wrote:
> Since I am processing Chinese Webs. There are instances that some Webs
> are not regular. When it said <meta encoding="GB2312">, we can't decode
> the HTML string with GB2312 decoder, we found that this content is
> encoded with GBK or GB18030 in fact. But the lxml parser will process it
> according to the meta declaration. I have read the source of libxml2,
> there are no "encoding" string match "GB2312","GBK" or "GB18030", but
> some other encodings like ENCODING_2022_JP which may be a super set of
> GB2312 I just don't know where in libxml2 or in lxml the <meta> declared
> GB2312 encoding is converted to some other encodings that apparently
> supported by libxml2's XML's xmlCharEncoding structure, and How?
Most encodings are handled by libiconv. libxml2 only handles the 'normal'
(or most common) XML encodings in its own code.
> I think
> since "GB18030" is a super set of "GB2312", if we change the lxml source
> to let all <meta charset = 'GB2312'> strings were decoded with GB18030
> codec, then there will be no error even if the HTML file is not regular.
No need to play with lxml's sources here. What should work is to read the
HTML page into a byte string, decode it manually into a unicode string, and
then let lxml parse that. That way, you also have full control over the
decoding and can handle any decoding errors yourself.
Stefan