windows utf8 & lxml
Steve D'Aprano
steve+python at pearwood.info
Tue Dec 27 05:46:35 EST 2016
On Tue, 20 Dec 2016 10:53 pm, Sayth Renshaw wrote:
> content.read().encode('utf-8'), parser=utf8_parser)
>
> However doing it in such a fashion returns this error:
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
> invalid start byte
That tells you that the XML file you have is not actually UTF-8.
You have a file that begins with a byte 0xFF. That is invalid UTF-8. No
valid UTF-8 string contains the byte 0xFF.
https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
So you need to consider:
- Are you sure that the input file is intended to be UTF-8? How was it
created?
- Is the second byte 0xFE? If so, that suggests that you actually have
UTF-16 with a byte-order mark.
--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list
mailing list