String is ASCII or UTF-8?

Roel Schroeven rschroev_nospam_ml at fastmail.fm
Tue Mar 9 13:13:07 EST 2010


Op 2010-03-09 18:31, C. Benson Manica schreef:
> On Mar 9, 12:24 pm, "Richard Brodie" <R.Bro... at rl.ac.uk> wrote:
>> "C. Benson Manica" <cbman... at gmail.com> wrote in messagenews:98375575-1071-46af-8ebc-f3c817b47e1d at q23g2000yqd.googlegroups.com...
>>
>>> The strings come from the same place, i.e. they're exclusively
>>> normal ASCII characters.
>>
>> In this case then converting them to/from UTF-8 is a no-op, so
>> it makes no difference at all.
> 
> Except to the database library, which seems perfectly happy to send an
> 8-character UTF-8 string to the database as 16 raw characters...

In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.

If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).

HTH,
Roel

-- 
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
  -- Isaac Asimov

Roel Schroeven



More information about the Python-list mailing list