Problem reading file with umlauts

Claus Hausberger CHausberger at gmx.de
Thu Jul 9 03:41:20 EDT 2009


Thanks a lot. I will try that on the weekend.

Claus

> Claus Hausberger wrote:
> > Thanks a lot. Now I am one step further but I get another strange error:
> > 
> > Traceback (most recent call last):
> >   File "./read.py", line 12, in <module>
> >     of.write(text)
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
> position 0: ordinal not in range(128)
> > 
> > according to google ufeff has something to do with byte order.
> > 
> > I use an Linux system, maybe this helps to find the error.
> > 
> 'text' contains Unicode, but you're writing it to a file that's not
> opened for Unicode. Either open the output file for Unicode:
> 
>      of = codecs.open("umlaut-out.txt", "w", encoding="latin1")
> 
> or encode the text before writing:
> 
>      text = text.encode("latin1")
> 
> (I'm assuming you want the output file to be in Latin1.)
> 
> > 
> >> Claus Hausberger wrote:
> >>
> >>> I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
> >>> change that as I do not create those files myself. I have to read
> >>> those files and convert the umlauts like ö to stuff like &oumol; as
> >>> the text files should become html files.
> >> umlaut-in.txt:
> >> ----
> >> This file is contains data in the unicode
> >> character set and is encoded with utf-8.
> >> Viele Röhre. Macht spaß!  Tsüsch!
> >>
> >>
> >> umlaut-in.txt hexdump:
> >> ----
> >> 000000: 54 68 69 73 20 66 69 6C  65 20 69 73 20 63 6F 6E This file is
> con
> >> 000010: 74 61 69 6E 73 20 64 61  74 61 20 69 6E 20 74 68 tains data in
> th
> >> 000020: 65 20 75 6E 69 63 6F 64  65 0D 0A 63 68 61 72 61 e
> unicode..chara
> >> 000030: 63 74 65 72 20 73 65 74  20 61 6E 64 20 69 73 20 cter set and
> is
> >> 000040: 65 6E 63 6F 64 65 64 20  77 69 74 68 20 75 74 66 encoded with
> utf
> >> 000050: 2D 38 2E 0D 0A 56 69 65  6C 65 20 52 C3 B6 68 72 -8...Viele
> R..hr
> >> 000060: 65 2E 20 4D 61 63 68 74  20 73 70 61 C3 9F 21 20 e. Macht
> spa..!
> >> 000070: 20 54 73 C3 BC 73 63 68  21 0D 0A 00 00 00 00 00 
> Ts..sch!.......
> >>
> >>
> >> umlaut.py:
> >> ----
> >> # -*- coding: utf-8 -*-
> >> import codecs
> >> text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
> >> text=text.replace(u"ö",u"oe")
> >> text=text.replace(u"ß",u"ss")
> >> text=text.replace(u"ü",u"ue")
> >> of=open("umlaut-out.txt","w")
> >> of.write(text)
> >> of.close()
> >>
> >>
> >> umlaut-out.txt:
> >> ----
> >> This file is contains data in the unicode
> >> character set and is encoded with utf-8.
> >> Viele Roehre. Macht spass!  Tsuesch!
> >>
> >>
> >> umlaut-out.txt hexdump:
> >> ----
> >> 000000: 54 68 69 73 20 66 69 6C  65 20 69 73 20 63 6F 6E This file is
> con
> >> 000010: 74 61 69 6E 73 20 64 61  74 61 20 69 6E 20 74 68 tains data in
> th
> >> 000020: 65 20 75 6E 69 63 6F 64  65 0D 0D 0A 63 68 61 72 e
> unicode...char
> >> 000030: 61 63 74 65 72 20 73 65  74 20 61 6E 64 20 69 73 acter set and
> is
> >> 000040: 20 65 6E 63 6F 64 65 64  20 77 69 74 68 20 75 74  encoded with
> ut
> >> 000050: 66 2D 38 2E 0D 0D 0A 56  69 65 6C 65 20 52 6F 65 f-8....Viele
> Roe
> >> 000060: 68 72 65 2E 20 4D 61 63  68 74 20 73 70 61 73 73 hre. Macht
> spass
> >> 000070: 21 20 20 54 73 75 65 73  63 68 21 0D 0D 0A 00 00 ! 
> Tsuesch!.....
> >>
> >>
> >>
> >>
> >>
> >> -- 
> >> "The ability of the OSS process to collect and harness
> >> the collective IQ of thousands of individuals across
> >> the Internet is simply amazing." - Vinod Valloppillil
> >> http://www.catb.org/~esr/halloween/halloween4.html
> > 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list

-- 
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02



More information about the Python-list mailing list