Some questions about decode/encode

John Machin sjmachin at lexicon.net
Sun Jan 27 06:04:02 EST 2008


On Jan 27, 9:18 pm, glacier <rong.x... at gmail.com> wrote:
> On 1月24日, 下午4时44分, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>
> > On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
> > > My second question is: is there any one who has tested very long mbcs
> > > decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> > > to be very strange and cause SAX fail to parse the decoded string.
>
> > That's because SAX wants bytes, not a decoded string.  Don't decode it
> > yourself.
>
> > > However, I use another text editor to convert the file to utf-8 and
> > > SAX will parse the content successfully.
>
> > Because now you feed SAX with bytes instead of a unicode string.
>
> > Ciao,
> >         Marc 'BlackJack' Rintsch
>
> Yepp. I feed SAX with the unicode string since SAX didn't support my
> encoding system(GBK).

Let's go back to the beginning. What is "SAX"? Show us exactly what
command or code you used.

How did you let this SAX know that the file was encoded in GBK? An
argument to SAX? An encoding declaration in the first few lines of the
file? Some other method? ... precise answer please. Or did you expect
that this SAX would guess correctly what the encoding was without
being told?

What does "didn't support my encoding system" mean? Have you actually
tried pushing raw undecoded GBK at SAX using a suitable documented
method of telling SAX that the file is in fact encoded in GBK? If so,
what was the error message that you got?

How do you know that it's GBK, anyway? Have you considered these
possible scenarios:
(1) It's GBK but you are telling SAX that it's GB2312
(2) It's GB18030 but you are telling SAX it's GBK

HTH,
John



More information about the Python-list mailing list