Processing text data with different encodings

Steven D'Aprano steve at
Tue Jun 28 06:29:28 EDT 2016

On Tue, 28 Jun 2016 06:35 pm, Michael Welle wrote:

> my original data is email. The mail header says it's utf-8, but you will
> find three or four different encodings in one email. I think at the
> sending side they just glue different text fragments from different
> sources together without thinking about the encoding.

Is this spam? In my experience, the only email that is that badly
constructed is spam. I can't imagine how it could be email from a person,
coming from a mail client like Thunderbird or Outlook.

