Validate string as UTF-8?

Tony Nelson *firstname*nlsnews at georgea*lastname*.com
Sun Nov 6 21:47:39 CET 2005

In article <mailman.176.1131307306.18701.python-list at>,
 "Fredrik Lundh" <fredrik at> wrote:

> Tony Nelson wrote:
> > I'd like to have a fast way to validate large amounts of string data as
> > being UTF-8.
> define "validate".

All data conforms to the UTF-8 encoding format.  I can stand if someone 
has made data that impersonates UTF-8 that isn't really Unicode.

> > I don't see a fast way to do it in Python, though:
> >
> >     unicode(s,'utf-8').encode('utf-8)
> if "validate" means "make sure the byte stream doesn't use invalid
> sequences", a plain
>     unicode(s, "utf-8")
> should be sufficient.

You are correct.  I misunderstood what was happening in my code.  I 
apologise for wasting bandwidth and your time (and I wasted my own time 
as well).

Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough 
for my purpose, adding about 25% to the time to load a file.
TonyN.:'                        *firstname*nlsnews at georgea*lastname*.com
      '                                  <>

More information about the Python-list mailing list