Validate string as UTF-8?
Tony Nelson
*firstname*nlsnews at georgea*lastname*.com
Sun Nov 6 15:47:39 EST 2005
In article <mailman.176.1131307306.18701.python-list at python.org>,
"Fredrik Lundh" <fredrik at pythonware.com> wrote:
> Tony Nelson wrote:
>
> > I'd like to have a fast way to validate large amounts of string data as
> > being UTF-8.
>
> define "validate".
All data conforms to the UTF-8 encoding format. I can stand if someone
has made data that impersonates UTF-8 that isn't really Unicode.
> > I don't see a fast way to do it in Python, though:
> >
> > unicode(s,'utf-8').encode('utf-8)
>
> if "validate" means "make sure the byte stream doesn't use invalid
> sequences", a plain
>
> unicode(s, "utf-8")
>
> should be sufficient.
You are correct. I misunderstood what was happening in my code. I
apologise for wasting bandwidth and your time (and I wasted my own time
as well).
Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough
for my purpose, adding about 25% to the time to load a file.
________________________________________________________________________
TonyN.:' *firstname*nlsnews at georgea*lastname*.com
' <http://www.georgeanelson.com/>
More information about the Python-list
mailing list