UTF-8 question from Dive into Python 3

Wed Jan 19 06:34:53 EST 2011

On 2011-01-19, Tim Roberts <timr at probo.com> wrote:
> Tim Harig <usernet at ilthio.net> wrote:
>>On 2011-01-17, carlo <sysengp2p at gmail.com> wrote:
>>
>>> 2- If that were true, can you point me to some documentation about the
>>> math that, as Mark says, demonstrates this?
>>
>>It is true because UTF-8 is essentially an 8 bit encoding that resorts
>>to the next bit once it exhausts the addressible space of the current
>>byte it moves to the next one.  Since the bytes are accessed and assessed
>>sequentially, they must be in big-endian order.
>
> You were doing excellently up to that last phrase.  Endianness only applies
> when you treat a series of bytes as a larger entity.  That doesn't apply to
> UTF-8.  None of the bytes is more "significant" than any other, so by
> definition it is neither big-endian or little-endian.

It depends how you process it and it doesn't generally make much
difference in Python.  Accessing UTF-8 data from C can be much trickier
if you use a multibyte type to store the data.  In that case, if happen
to be on a little-endian architecture, it may be necessary to remember
that the data is not in the order that your processor expects it to be
for numeric operations and comparisons.  That is why the FAQ I linked to
says yes to the fact that you can consider UTF-8 to always be in big-endian
order.  Essentially all byte based data is big-endian.