Read file that starts with '\xff\xfe'
Piet van Oostrum
piet at cs.uu.nl
Wed Sep 10 12:24:30 CEST 2003
>>>>> Bob Gailer <bgailer at alum.rpi.edu> (BG) wrote:
BG> On Win 2K the Task Scheduler writes a log file that appears to be encoded.
BG> The first line is:
BG> My goal is to read this file and process it using Python string
BG> I am disappointed in the codecs module documentation. I had hoped to find
BG> the answer there, but can't.
BG> I presume this is an encoding, and that '\xff\xfe' defines the encoding.
BG> How does one map '\xff\xfe' to an "encoding".
It's Unicode, actually Little Endian UTF-16, which is the standard encoding
on Win2K. The '\xff\xfe' is the Byte Order mark (BOM) which signifies it
as Little Endian.
>>> import codecs
But there is a trailing 0 byte missing (it should have an even number of
bytes, as each character occupies two bytes). Of course this comes because
you think a line ends with '\n', whereas in UTF-16LE it ends with '\n\x00'.
This also means you cannot read them with methods like readline().
>>> st='\xff\xfe"\x00T\x00a\x00s\x00k\x00 \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00 \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n\x00'
u'"Task Scheduler Service"\r\n'
'"Task Scheduler Service"\r\n'
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl
More information about the Python-list