Unicode from Web to MySQL

Sat Dec 20 17:55:19 EST 2003

"Francis Avila" <francisgavila at yahoo.com> wrote in message news:vu9epa99laks26 at corp.supernews.com...

> In other words, there is no standard, 100% reliable method of getting the
> encoding of a web page.

There is a standard way. But you're right, it's not 100% reliable.

> In an ideal world, the http header would have it, and that's that.
That's actually is not a good idea, because it will force the http server
writers to parse html header for the encoding. They will get away
with configuration parameter forcing all server files to be in one
encoding. But one day somebody will store a file in the wrong
encoding *for sure*. http header encoding is a bad idea.

> In the real world, you have to juggle various combinations
> of information, missing information, and disinformation from the http
> protocol header's info, the html file's meta info, and charset guessing
> algorithms (look for Enca).

It's not so bad. Web server and content editor writers are slowing
getting a clue. It used to be very bad, I remember it. I think the peak of
problems was in 98-99 years. But nowadays more than 99% of
web documents get encoding right. So having a simple read_unicode()
method of urlopener class would be very useful.