Some questions about decode/encode
Ben Finney
bignose+hates-spam at benfinney.id.au
Thu Jan 24 00:41:02 EST 2008
Ben Finney <bignose+hates-spam at benfinney.id.au> writes:
> glacier <rong.xian at gmail.com> writes:
>
> > I use chinese charactors as an example here.
> >
> > >>>s1='你好吗'
> > >>>repr(s1)
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > >>>b1=s1.decode('GBK')
> >
> > My first question is : what strategy does 'decode' use to tell the
> > way to seperate the words. I mean since s1 is an multi-bytes-char
> > string, how did it determine to seperate the string every 2bytes
> > or 1byte?
>
> The codec you specified ("GBK") is, like any character-encoding
> codec, a precise mapping between characters and bytes. It's almost
> certainly not aware of "words", only character-to-byte mappings.
To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.
That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.
--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney
More information about the Python-list
mailing list