[Tutor] name shortening in a csv module output
Steven D'Aprano
steve at pearwood.info
Fri Apr 24 03:55:59 CEST 2015
On Thu, Apr 23, 2015 at 05:40:34PM -0400, Dave Angel wrote:
> On 04/23/2015 05:08 PM, Mark Lawrence wrote:
> >
> >Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :)
> >
>
> As I recall, it stands for "Byte Order Mark". Applicable only to
> multi-byte storage formats (eg. UTF-16), it lets the reader decide
> which of the formats were used.
>
> For example, a file that reads
>
> fe ff 41 00 42 00
>
> might be a big-endian version of UTF-16
>
> while
> ff fe 00 41 00 42
>
> might be the little-endian version of the same data.
Almost :-)
You have the string ")*", two characters. In ASCII, Latin-1, Mac-Roman,
UTF-8, and many other encodings, that is represented by two code points.
I'm going to use "U+ hex digits" as the symbol for code points, to
distinguish them from raw bytes which won't use the U+ prefix.
string ")*" gives code points U+41 U+42
They get written out to a single byte each, and so we get
41 42
as the sequence of bytes (still written in hex).
In UTF-16, those two characters are represented by the same two code
points, *but* the "code unit" is two bytes rather than one:
U+0041 U+0042
with leading zeroes included. Each code unit gets written out as a
two-byte quantity:
On little-endian systems like Intel hardware: 4100 4200
On big-endian systems like Motorola hardware: 0041 0042
Insert the BOM, which is always code point U+FEFF:
On little-endian systems: FFFE 4100 4200
On big-endian systems: FEFF 0041 0042
If you take that file and read it back as Latin-1, you get:
little-endian: ÿþA\0B\0
big-endian: þÿ\0A\0B
Notice the \0 nulls? Your editor might complain that the file is a
binary file, and refuse to open it, unless you tell the editor it is
UTF-16.
> The same concept was used many years ago in two places I know of.
> Binary files representing faxes had "II" or "MM" at the beginning.
Yes, TIFF files use a similar scheme. You get them starting with a
signature TIFF or FFTI, I believe.
--
Steve
More information about the Tutor
mailing list