UTF-8 question from Dive into Python 3

Antoine Pitrou solipsis at pitrou.net
Mon Jan 17 17:34:57 EST 2011


On Mon, 17 Jan 2011 14:19:13 -0800 (PST)
carlo <sysengp2p at gmail.com> wrote:
> Is it true UTF-8 does not have any "big-endian/little-endian" issue
> because of its encoding method?

Yes.

> And if it is true, why Mark (and
> everyone does) writes about UTF-8 with and without BOM some chapters
> later? What would be the BOM purpose then?

"BOM" in this case is a misnomer. For UTF-8, it is only used as a
marker (a magic number, if you like) to signal than a given text file
is UTF-8. The UTF-8 "BOM" does not say anything about byte order; and,
actually, it does not change with endianness.

(note that it is not required to put an UTF-8 "BOM" at the beginning of
text files; it is just a hint that some tools use when
generating/reading UTF-8)

> 2- If that were true, can you point me to some documentation about the
> math that, as Mark says, demonstrates this?

Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
encoding. There is no math involved, it just works by construction.

Regards

Antoine.





More information about the Python-list mailing list