UTF-8 question from Dive into Python 3

Wed Jan 19 10:11:51 EST 2011

On Jan 19, 9:00 am, Tim Harig <user... at ilthio.net> wrote:
>
> So, you can always assume a big-endian and things will work out correctly
> while you cannot always make the same assumption as little endian
> without potential issues.  The same holds true for any byte stream data.

You need to spend some serious time programming a serial port or other
byte/bit-stream oriented interface, and then you'll realize the folly
of your statement.

> That is why I say that byte streams are essentially big endian.  It is
> all a matter of how you look at it.

It is nothing of the sort.  Some byte streams are in fact, little
endian: when the bytes are combined into larger objects, the least-
significant byte in the object comes first.  A lot of industrial/
embedded stuff has byte streams with LSB leading in the sequence, CAN
comes to mind as an example.

The only way to know is for the standard describing the stream to tell
you what to do.

>
> I prefer to look at all data as endian even if it doesn't create
> endian issues because it forces me to consider any endian issues that
> might arise.  If none do, I haven't really lost anything.  
> If you simply assume that any byte sequence cannot have endian issues you ignore the
> possibility that such issues might not arise.

No, you must assume nothing unless you're told how to combine the
bytes within a sequence into a larger element.  Plus, not all byte
streams support such operations!  Some byte streams really are just a
sequence of bytes and the bytes within the stream cannot be
meaningfully combined into larger data types. If I give you a series
of 8-bit (so 1 byte) samples from an analog-to-digital converter, tell
me how to combine them into a 16, 32, or 64-bit integer.  You cannot
do it without altering the meaning of the samples; it is a completely
non-nonsensical operation.

Adam