unicode, bytes redux
John Machin
sjmachin at lexicon.net
Mon Sep 25 04:17:47 EDT 2006
willie wrote:
> (beating a dead horse)
>
> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?
Where it's been is irrelevant. Where it's going to is what matters.
> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
>
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
>
> u = buf.decode('UTF-8')
>
> # ... later ...
>
> u.bytes() -> 3
>
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)
Suppose the unicode object was decoded using some encoding other than
the one that's going to be used to store the info in the database:
| >>> sg = '\xc9\xb5\xb9\xcf'
| >>> len(sg)
| 4
| >>> u = sg.decode('gb2312')
later:
u.bytes() => 4
but
| >>> len(u.encode('utf8'))
| 6
and by the way, what about the memory overhead of storing the name of
the encoding (in the above case 7 (6 + overhead))?
What would u"abcdef".bytes() produce? An exception?
HTH,
John
More information about the Python-list
mailing list