String is ASCII or UTF-8?

Stef Mientki stef.mientki at gmail.com
Tue Mar 9 16:36:08 EST 2010


On 09-03-2010 18:36, Robert Kern wrote:
> On 2010-03-09 11:12 AM, Stef Mientki wrote:
>> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>>> * C. Benson Manica:
>>>> Hours of Googling has not helped me resolve a seemingly simple
>>>> question - Given a string s, how can I tell whether it's ascii (and
>>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>>
>>> Generally, if you need 100% certainty then you can't tell the encoding
>>> from a sequence of byte values.
>>>
>>> However, if you know that it's EITHER ascii or utf-8 then the presence
>>> of any value above 127 (or, for signed byte values, any negative
>>> values), tells you that it can't be ascii,
>> AFAIK it's completely impossible.
>> UTF-8 characters have 1 to 4 bytes / byte.
>> I can create ASCII strings containing byte values between 127 and 255.
>
> No, you can't. ASCII strings only have characters in the range 0..127. 
> You could create Latin-1 (or any number of the 8-bit encodings out 
> there) strings with characters 0..255, yes, but not ASCII.
>
Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit  ;-)

cheers,
Stef
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100309/4f0a9df4/attachment.html>


More information about the Python-list mailing list