<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html; charset=ISO-8859-1"

 http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

On 09-03-2010 18:36, Robert Kern wrote:

<blockquote cite="mid:hn60vg$5pi$1@dough.gmane.org" type="cite">On

2010-03-09 11:12 AM, Stef Mientki wrote:

  <br>

  <blockquote type="cite">On 09-03-2010 18:02, Alf P. Steinbach wrote:

    <br>

    <blockquote type="cite">* C. Benson Manica:

      <br>

      <blockquote type="cite">Hours of Googling has not helped me

resolve a seemingly simple

        <br>

question - Given a string s, how can I tell whether it's ascii (and

        <br>

thus 1 byte per character) or UTF-8 (and two bytes per character)?

        <br>

This is python 2.4.3, so I don't have getsizeof available to me.

        <br>

      </blockquote>

      <br>

Generally, if you need 100% certainty then you can't tell the encoding

      <br>

from a sequence of byte values.

      <br>

      <br>

However, if you know that it's EITHER ascii or utf-8 then the presence

      <br>

of any value above 127 (or, for signed byte values, any negative

      <br>

values), tells you that it can't be ascii,

      <br>

    </blockquote>

AFAIK it's completely impossible.

    <br>

UTF-8 characters have 1 to 4 bytes / byte.

    <br>

I can create ASCII strings containing byte values between 127 and 255.

    <br>

  </blockquote>

  <br>

No, you can't. ASCII strings only have characters in the range 0..127.

You could create Latin-1 (or any number of the 8-bit encodings out

there) strings with characters 0..255, yes, but not ASCII.

  <br>

  <br>

</blockquote>

<font size="+1">Probably, and according to wikipedia you're right.<br>

I think I've to get rid of my old books, <br>

Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,<br>

while the small letters say 7-bit  ;-)<br>

<br>

cheers,<br>

Stef<br>

</font>

</body>

</html>