UTF-8 question from Dive into Python 3

Tim Harig usernet at ilthio.net
Wed Jan 19 14:18:49 EST 2011


On 2011-01-19, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
> Tim Harig <usernet at ilthio.net> wrote:
>> Converting to a fixed byte
>> representation (UTF-32/UCS-4) or separating all of the bytes for each
>> UTF-8 into 6 byte containers both make it possible to simply index the
>> letters by a constant size.  You will note that Python does the
>> former.
>
> Indeed, Python chose the wise option. Actually, I'd be curious of any
> real-world software which successfully chose your proposed approach.

The point is basically the same.  I created an example because it
was simpler to follow for demonstration purposes then an actual UTF-8
conversion to any official multibyte format.  You obviously have no
other purpose then to be contrary, so we ended up following tangents.

As soon as you start to convert to a multibyte format the endian issues
occur.  For UTF-8 on big endian hardware, this is anti-climactic because
all of the bits are already stored in proper order.  Little endian systems
will probably convert to a native native endian format.  If you choose
to ignore that, that is your perogative.  Have a nice day.



More information about the Python-list mailing list