Most pythonic way to truncate unicode?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Fri May 29 00:27:58 EDT 2009
On Fri, 29 May 2009 04:09:53 +0000, John Machin wrote:
> John Machin <sjmachin <at> lexicon.net> writes:
>
>> Andrew Fong <FongAndrew <at> gmail.com> writes:
>
> > Are
>> > there any built-in ways to do something like this already? Or do I
>> > just have to iterate over the unicode string?
>>
>> Converting each character to utf8 and checking the total number of
>> bytes so far?
>> Ooooh, sloooowwwwww!
>>
>>
> Somewhat faster:
What's wrong with Peter Otten's solution?
>>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")
u'\xe4\xf6'
At most, you should have one error, at the very end. If you ignore it,
you get the unicode characters that have length <= 5 in *bytes* when
encoded as UTF-8.
(If you encode using a different codec, you will likely get a different
number of bytes.)
--
Steven
More information about the Python-list
mailing list