String is ASCII or UTF-8?

Tue Mar 9 12:07:44 EST 2010

On 09/03/2010 16:54, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.

You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.

Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:

try:
   unicode (text, "ascii")
except UnicodeDecodeError:
   print "Not ASCII"

TJG