[I18n-sig] Autoguessing charset for Unicode strings?
Tim Peters
tim.one@home.com
Tue, 19 Jun 2001 20:32:19 -0400
[Machin, John]
> maybe not so expensive, depending on (a) what's in C and what's in
> Python and (b) function call overhead and (c) what proportion of text
> needs which character set ...
>
> loop once through your Unicode;
> if there were any chars with ordinal > 255, then use UTF-8
> elif there were any > 127, then use iso-8859-1
> else use ASCII
I don't know whether that algorithm makes sense, but it's efficient enough
in Python:
biggest = max(map(ord, some_unicode_string))
if biggest > 255:
yadda
elif biggest > 127:
yadda
else:
yadda
So the bulk of the work goes almost entirely at C speed.