character encoding conversion
christian.ergh at gmail.com
Sun Dec 12 20:29:59 CET 2004
Martin v. Löwis wrote:
> Dylan wrote:
>> Things I have tried include encode()/decode()
> This should work. If you somehow manage to guess the encoding,
> e.g. guess it as cp1252, then
> htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
> will give you a file that contains only ASCII characters, and
> character references for everything else.
> Now, how should you guess the encoding? Here is a strategy:
> 1. use the encoding that was sent through the HTTP header. Be
> absolutely certain to not ignore this encoding.
> 2. use the encoding in the XML declaration (if any).
> 3. use the encoding in the http-equiv meta element (if any)
> 4. use UTF-8
> 5. use Latin-1, and check that there are no characters in the
> 6. use cp1252
> 7. use Latin-1
> In the order from 1 to 6, check whether you manage to decode
> the input. Notice that in step 5, you will definitely get successful
> decoding; consider this a failure if you have get any control
> characters (from range(128, 160)); then try in step 7 latin-1
> When you find the first encoding that decodes correctly, encode
> it with ascii and xmlcharrefreplace, and you won't need to worry
> about the encoding, anymore.
I have a similar problem, with characters like äöüAÖÜß and so on. I am
extracting some content out of webpages, and they deliver whatever,
sometimes not even giving any encoding information in the header. But
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
More information about the Python-list