Best ways of managing text encodings in source/regexes?

Tue Nov 27 09:02:12 EST 2007

On Tue, Nov 27, 2007 at 01:30:15AM +0200, tinker barbet wrote regarding Re: Best ways of managing text encodings in source/regexes?:
> 
> Hi
> 
> Thanks for your responses, as I said on the reply I posted I thought
> later it was a bit long so I'm grateful you held out!
> 
> I should have said (but see comment about length) that I'd tried
> joining a unicode and a byte string in the interpreter and got that
> working, but wondered whether it was safe (I'd seen the odd warning
> about mixing byte strings and unicode).  Anyway what I was concerned
> about was what python does with source files rather than what happens
> from the interpreter, since I don't know if it's possible to change
> the encoding of the terminal without messing with site.py (and what
> else).
> 
> Aren't both ASCII and ISO-8859-1 subsets of UTF-8?  Can't you then use
> chars from either of those charsets in a file saved as UTF-8 by one's
> editor, with a # -*- coding: utf-8 -*- pseudo-declaration for python,
> without problems?  You seem to disagree.
> 
I do disagree.  Unicode is a superset of ISO-8859-1, but UTF-8 is a specific encoding, which changes many of the binary values.  UTF-8 was designed specifically not to change the values of ascii characters.  0x61 (lower case a) in ascii is encoded with the bits 0110 0001.  In UTF-8 it is also encoded 0110 0001.  However, Ã, "latin small letter n with tilde", is unicode/iso-8859-1 character 0xf1.  In ISO-8859-1, this is represented by the bits 1111 0001.  

UTF-8 gets a little tricky here.  In order to be extensible beyond 8 bits, it has to insert control bits at the beginning, so this character actually requires 2 bytes to represent instead of just one.  In order to show that UTF-8 will be using two bytes to represent the character, the first byte begins with 110 (1110 is used when three bytes are needed).  Each successive byte begins with 10 to show that it is not the beginning of a character.  Then the code-point value is packed into the remaining free bits, as far to the right as possible.  So in this case, the control bits are 

110x xxxx 10xx xxxx.  

The character value, 0xf1, or:
1111 0001

gets inserted as follows:

110x xx{11}  10{11 0001}

and the remaining free x-es get replaced by zeroes.

1100 0011  1011 0001.

Note that the python interpreter agrees:

py>>> x = u'\u00f1'
py>>> x.encode('utf-8')
'\xc3\xb1'

(Conversion from binary to hex is left as an exercise for the reader)

So while ASCII is a subset of UTF-8, ISO-8859-1 is definitely not.  As others have said many times when this issue periodically comes up: UTF-8 is not unicode.  Hopefully this will help explain exactly why.

Note that with other encodings, like UTF-16, even ascii is not a subset.

See the wikipedia article on UTF-8 for a more complete explanation and external references to official documentation (http://en.wikipedia.org/wiki/UTF-8).

> The reason all this arose was that I was using ISO-8859-1/Latin-1 with
> all the right declarations, but then I needed to match a few chars
> outside of that range.  So I didn't need to use u"" before, but now I
> do in some regexes, and I was wondering if this meant that /all/ my
> regexes had to be constructed from u"" strings or whether I could just
> do the selected ones, either using literals (and saving the file as
> UTF-8) or unicode escape sequences (and leaving the file as ASCII -- I
> already use hex escape sequences without problems but that doesn't
> work past the end of ISO-8859-1).
> 

Do you know about unicode escape sequences?

py>>> u'\xf1' == u'\u00f1'
True

> Thanks again for your feedback.
> 
> Best wishes
> Tim
> 

No problem.  It took me a while to wrap my head around it, too.

Cheers,
Cliff