unicode, bytes redux
sjmachin at lexicon.net
Mon Sep 25 10:17:47 CEST 2006
> (beating a dead horse)
> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?
Where it's been is irrelevant. Where it's going to is what matters.
> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
> u = buf.decode('UTF-8')
> # ... later ...
> u.bytes() -> 3
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)
Suppose the unicode object was decoded using some encoding other than
the one that's going to be used to store the info in the database:
| >>> sg = '\xc9\xb5\xb9\xcf'
| >>> len(sg)
| >>> u = sg.decode('gb2312')
u.bytes() => 4
| >>> len(u.encode('utf8'))
and by the way, what about the memory overhead of storing the name of
the encoding (in the above case 7 (6 + overhead))?
What would u"abcdef".bytes() produce? An exception?
More information about the Python-list