Is this a bug? BOM decoded with UTF8

"Martin v. Löwis" martin at v.loewis.de
Thu Feb 10 15:24:58 EST 2005


pekka niiranen wrote:
> I have two files "my.utf8" and "my.utf16" which
> both contain BOM and two "a" characters.
> 
> Contents of "my.utf8" in HEX:
>     EFBBBF6161
> 
> Contents of "my.utf16" in HEX:
>     FEFF6161

This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.

> Is there a trick to read UTF8 encoded file with BOM not decoded?

It's very easy: just drop the first character if it is the BOM.

The UTF-8 codec will never do this on its own.

Regards,
Martin



More information about the Python-list mailing list