Most pythonic way to truncate unicode?

Andrew Fong FongAndrew at gmail.com
Thu May 28 11:50:00 EDT 2009


I need to ...

1) Truncate long unicode (UTF-8) strings based on their length in
BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but
takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate
(u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to
u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.

2) I don't want to accidentally chop any unicode characters in half.
If the byte truncate length would normally cut a unicode character in
2, then I just want to drop the whole character, not leave an orphaned
byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as
opposed to getting UnicodeDecodeError.

I'm using Python2.6, so I have access to things like bytearray. Are
there any built-in ways to do something like this already? Or do I
just have to iterate over the unicode string?

-- Andrew



More information about the Python-list mailing list