Some questions about decode/encode

Ben Finney bignose+hates-spam at benfinney.id.au
Thu Jan 24 00:41:02 EST 2008


Ben Finney <bignose+hates-spam at benfinney.id.au> writes:

> glacier <rong.xian at gmail.com> writes:
> 
> > I use chinese charactors as an example here.
> > 
> > >>>s1='你好吗'
> > >>>repr(s1)
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > >>>b1=s1.decode('GBK')
> > 
> > My first question is : what strategy does 'decode' use to tell the
> > way to seperate the words. I mean since s1 is an multi-bytes-char
> > string, how did it determine to seperate the string every 2bytes
> > or 1byte?
> 
> The codec you specified ("GBK") is, like any character-encoding
> codec, a precise mapping between characters and bytes. It's almost
> certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

-- 
 \                 "He who laughs last, thinks slowest."  -- Anonymous |
  `\                                                                   |
_o__)                                                                  |
Ben Finney



More information about the Python-list mailing list