Most pythonic way to truncate unicode?

John Machin sjmachin at lexicon.net
Fri May 29 01:31:40 EDT 2009


Steven D'Aprano <steve <at> REMOVE-THIS-cybersource.com.au> writes:

> 
> On Fri, 29 May 2009 04:09:53 +0000, John Machin wrote:
> 
> > John Machin <sjmachin <at> lexicon.net> writes:
> > 
> >> Andrew Fong <FongAndrew <at> gmail.com> writes:
> > 
> >  > Are
> >> > there any built-in ways to do something like this already? Or do I
> >> > just have to iterate over the unicode string?
> >> 
> >> Converting each character to utf8 and checking the total number of
> >> bytes so far?
> >> Ooooh, sloooowwwwww!
> >> 
> >> 
> > Somewhat faster:
> 
> What's wrong with Peter Otten's solution?
> 
> >>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")

Given the minimal info supplied by the OP, nothing. However if the OP were to
respond to your "why" question, and give some more info like how long is long,
what percentage of the average string is thrown away, does he have the utf8
version anyway, what's he going to do with the unicode version after it's been
truncated (convert it to utf8??), it may turn out that a unicode forwards search
or a utf8 backwards search may be preferable.

If Pyrex/Cython is an option, then one of those is likely to be preferable if
runtime is a major consideration.

Cheers,
John




More information about the Python-list mailing list