Most pythonic way to truncate unicode?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Thu May 28 20:49:37 EDT 2009
On Thu, 28 May 2009 08:50:00 -0700, Andrew Fong wrote:
> I need to ...
>
> 1) Truncate long unicode (UTF-8) strings based on their length in BYTES.
Out of curiosity, why do you need to do this?
> For example, u'\u4000\u4001\u4002 abc' has a length of 7 but takes up 13
> bytes.
No, that's wrong. The number of bytes depends on the encoding, it's not a
property of the unicode string itself.
>>> s = u'\u4000\u4001\u4002 abc'
>>> len(s) # characters
7
>>> len(s.encode('utf-8')) # bytes
13
>>> len(s.encode('utf-16')) # bytes
16
>>> len(s.encode('U32')) # bytes
32
> Since u'\u4000' takes up 3 bytes
But it doesn't. The *encoded* unicode character *may* take up three
bytes, or four, or possibly more, depending on what encoding you use.
--
Steven
More information about the Python-list
mailing list