Validate string as UTF-8?
Diez B. Roggisch
deets at nospam.web.de
Sun Nov 6 15:10:55 EST 2005
Tony Nelson wrote:
> I'd like to have a fast way to validate large amounts of string data as
> being UTF-8.
>
> I don't see a fast way to do it in Python, though:
>
> unicode(s,'utf-8').encode('utf-8)
>
> seems to notice at least some of the time (the unicode() part works but
> the encode() part bombs). I don't consider a RE based solution to be
> fast. GLib provides a routine to do this, and I am using GTK so it's
> included in there somewhere, but I don't see a way to call GLib
> routines. I don't want to write another extension module.
I somehow doubt that the encode bombs. Can you provide some more
details? Maybe of some allegedly not working strings?
Besides that, it's unneccessary - the unicode(s, "utf-8") should be
sufficient. If there are any undecodable byte sequences in there, that
should find them.
Regards,
Diez
More information about the Python-list
mailing list