Problem reading file with umlauts
Michiel Overtoom
motoom at xs4all.nl
Tue Jul 7 11:05:32 EDT 2009
Claus Hausberger wrote:
> I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
> change that as I do not create those files myself. I have to read
> those files and convert the umlauts like ö to stuff like &oumol; as
> the text files should become html files.
umlaut-in.txt:
----
This file is contains data in the unicode
character set and is encoded with utf-8.
Viele Röhre. Macht spaß! Tsüsch!
umlaut-in.txt hexdump:
----
000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
000020: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara
000030: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is
000040: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf
000050: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr
000060: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..!
000070: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!.......
umlaut.py:
----
# -*- coding: utf-8 -*-
import codecs
text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
text=text.replace(u"ö",u"oe")
text=text.replace(u"ß",u"ss")
text=text.replace(u"ü",u"ue")
of=open("umlaut-out.txt","w")
of.write(text)
of.close()
umlaut-out.txt:
----
This file is contains data in the unicode
character set and is encoded with utf-8.
Viele Roehre. Macht spass! Tsuesch!
umlaut-out.txt hexdump:
----
000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
000020: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char
000030: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is
000040: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut
000050: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8....Viele Roe
000060: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass
000070: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!.....
--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
More information about the Python-list
mailing list