Most pythonic way to truncate unicode?
steve at REMOVE-THIS-cybersource.com.au
Fri May 29 02:49:37 CEST 2009
On Thu, 28 May 2009 08:50:00 -0700, Andrew Fong wrote:
> I need to ...
> 1) Truncate long unicode (UTF-8) strings based on their length in BYTES.
Out of curiosity, why do you need to do this?
> For example, u'\u4000\u4001\u4002 abc' has a length of 7 but takes up 13
No, that's wrong. The number of bytes depends on the encoding, it's not a
property of the unicode string itself.
>>> s = u'\u4000\u4001\u4002 abc'
>>> len(s) # characters
>>> len(s.encode('utf-8')) # bytes
>>> len(s.encode('utf-16')) # bytes
>>> len(s.encode('U32')) # bytes
> Since u'\u4000' takes up 3 bytes
But it doesn't. The *encoded* unicode character *may* take up three
bytes, or four, or possibly more, depending on what encoding you use.
More information about the Python-list