Re[2]: [I18n-sig] Encoding auto-detection

Cyrus Shaoul
Sat, 02 Jun 2001 08:33:35 -0400

I have to agree with Tom. If there is room for human error, there will
be lots of errors. I have personally seen many CGI scripts that have
been sent data in unexpected encodings by buggy browsers. These browsers
still are in use (ex: IE 3.0), and I bet some future browser will
contain a similar bug in the future.

Just my .02,


> This is a utopian idea that completely falls apart in the real world.
> It is *very* common for email to be sent making use of both 8-bit and
> 7-bit encodings with no content-type or content-transfer-encoding.
> Without some form of encoding/character set detection you have no idea
> what the mail message is encoded with. The fact that the mail RFCs
> dictate something is irrelevant.
> Similarly you can almost never trust the character encoding specified
> for web pages. I have seen a lot of pages that claim to be using
> CP1252 or ISO-8859-1 that are actually encoded with Shift-JIS or
> EUC-CN or Big 5. Indeed, when I was working on the Device Mosaic
> browser (the descendent of NCSA Mosaic that is was targeted for
> embedded devices) if we found a document claiming to be Latin-1 we
> ignored it and sniffed the encoding.
> It is also common to find pages in Japan, China, and Korea that don't
> specify a character set or encoding at all... the authors make
> assumptions about the people viewing the pages, which may be false. I
> have also seen Japanese pages that contain Shift-JIS *and* EUC-JP
> encoded characters in the *same* document.
> Higher level protocols cannot be believed.
>     -tree