String is ASCII or UTF-8?
Alf P. Steinbach
alfps at start.no
Tue Mar 9 12:02:29 EST 2010
* C. Benson Manica:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.
However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.
If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.
Cheers & hth.,
- Alf
More information about the Python-list
mailing list