String is ASCII or UTF-8?
Stef Mientki
stef.mientki at gmail.com
Tue Mar 9 12:12:48 EST 2010
On 09-03-2010 18:02, Alf P. Steinbach wrote:
> * C. Benson Manica:
>> Hours of Googling has not helped me resolve a seemingly simple
>> question - Given a string s, how can I tell whether it's ascii (and
>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>> This is python 2.4.3, so I don't have getsizeof available to me.
>
> Generally, if you need 100% certainty then you can't tell the encoding
> from a sequence of byte values.
>
> However, if you know that it's EITHER ascii or utf-8 then the presence
> of any value above 127 (or, for signed byte values, any negative
> values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.
cheers,
Stef
> hence, must be utf-8. And since utf-8 is an extension of ascii nothing
> is lost by assuming ascii in the other case. So, problem solved.
>
> If the string represents the contents of a file then you may also look
> for an UTF-8 represention of the Unicode BOM (Byte Order Mark) at the
> beginning. If found then it indicates utf-8 for almost-sure and more
> expensive searching can be avoided. It's just three bytes to check.
>
>
> Cheers & hth.,
>
> - Alf
More information about the Python-list
mailing list