[I18n-sig] Encoding auto-detection
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Sat, 2 Jun 2001 08:59:35 +0200
> It is *very* common for email to be sent making use of both 8-bit and
> 7-bit encodings with no content-type or content-transfer-encoding.
I think this claim is difficult to support by facts. Of the messages I
receive, most do have a MIME header, giving a charset in their
content.
> Indeed, when I was working on the Device Mosaic browser (the
> descendent of NCSA Mosaic that is was targeted for embedded devices)
> if we found a document claiming to be Latin-1 we ignored it and
> sniffed the encoding.
That might be a useful thing to do, but I guess the routine you've
been using was way more complex than what MAL suggested for the
standard library. I doubt you can reliably detect Big 5 by looking at
the first 10 or so bytes of an HTML document.
In fact, I'd suggest that HTML encoding detection is yet again
different from general-purpose encoding detection, since you'll have
to take the declared encoding (if any) into account.
> Higher level protocols cannot be believed.
And neither can autodetection.
Regards,
Martin