Most pythonic way to truncate unicode?

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri May 29 00:27:58 EDT 2009


On Fri, 29 May 2009 04:09:53 +0000, John Machin wrote:

> John Machin <sjmachin <at> lexicon.net> writes:
> 
>> Andrew Fong <FongAndrew <at> gmail.com> writes:
> 
>  > Are
>> > there any built-in ways to do something like this already? Or do I
>> > just have to iterate over the unicode string?
>> 
>> Converting each character to utf8 and checking the total number of
>> bytes so far?
>> Ooooh, sloooowwwwww!
>> 
>> 
> Somewhat faster:

What's wrong with Peter Otten's solution?

>>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")
u'\xe4\xf6'

At most, you should have one error, at the very end. If you ignore it, 
you get the unicode characters that have length <= 5 in *bytes* when 
encoded as UTF-8.

(If you encode using a different codec, you will likely get a different 
number of bytes.)


-- 
Steven



More information about the Python-list mailing list