Help opening and reading text files in ISO-8859-1

Ken Seehof kseehof at neuralintegrator.com
Sat Jan 12 03:04:41 EST 2002


Actually your problems don't really have anything to do with
character set issues.

re.split() returns a list, not a string.  List objects are represented
with [] brackets.  Also, strings are displayed in their raw form
when inside containers such as lists.  The raw representation of a
string will only contain ascii characters and represents non-ascii
printable characters such as umlauts with that '/374' style notation.

The buffering argument in open() does not do what you think it
does.  Don't bother using it.  Actually it won't have any noticable
effect in your example and you don't need to learn about it now.
Instead, if you really just want it to output slower, you could
add a time.sleep(0.2) after the print statement.  This would make
it print 5 lines per second.  Alternatively you add this:

lines_printed = 0
...
lines_printed = lines_printed + 1
if (lines_printed % 10) == 0:
   raw_input()   # wait for user to hit <enter>


Try these examples, then experiment and read manuals until
each of these responses seems logical to you:

>>> s = "Büro"
>>> s
'B\374ro'
>>> print s
Büro
>>> print repr(s)
'B\374ro'
>>> x = [s]
>>> print x
['B\374ro']
>>> re.split('[,]', 'das Büro, Büros    7')
['das B\374ro', ' B\374ros    7']
>>> print 'das B\374ro'
das Büro

By the way, the regular expression you are looking for is:
re.compile("([\w\s]*\w)(?:,\s*(\w*))?\s*(\d+)")
group 1 = "das Büro"
group 2 = "Büros"
group 3 = "7"

If you need more help with this, I am available for a contract.
kseehof at neuralintegrator.com

> Hi-
>
>     Basically, I have this text file with words in German, that looks like
> this:
>
> das Appartement, Appartements  5
> das Auge, Augen    6
> das Bad, Bäder    5
> das Bein, Beine    6
> das Beispiel, Beispiele   6
> das Buch, Bücher    4
> das Büro, Büros    7
> das Café, Cafés    4
> das Camping    9
> das Dach, Dächer    5
>
>     (the numbers are chapter numbers)
>     When I open it with a little Python script (which I pasted below), I
get
> this weird output:
>
> ['das Appartement', ' Appartements\011\0115\012']
> ['das Auge', ' Augen\011\011\011\0116\012']
> ['das Bad', ' B\344der\011\011\011\0115\012']
> ['das Bein', ' Beine\011\011\011\0116\012']
> ['das Beispiel', ' Beispiele\011\011\0116\012']
> ['das Buch', ' B\374cher\011\011\011\0114\012']
> ['das B\374ro', ' B\374ros\011\011\011\0117\012']
> ['das Caf\351', ' Caf\351s\011\011\011\0114\012']
>
>     In particular, to things bother me the most:
>     1) Where are my umlauts ("Büro", not "B\374ro'"; same with "Café",
etc.)
>     2) Does the output >have< to be with those horrible brackets?
>     3) There's no buffering, despite the fact that I set buffering = 10
(see
> code below);
> the output just scrolllls too fast to read.
>
>     So you see, it's not the output I wished for.
>
>     The code I used was:
>
> import re
> filename = raw_input ('Enter file name: ')
> file = open (filename, 'r', 10)
> allLines = file.readlines()
> file.close()
> for eachLine in allLines:
>     string = re.split ('[,]' , eachLine)
>     print string
>
> wortmatch = '(\w$)(\w$)'
>
>     I've tried
>         print string.encode('iso-8859-1')
>     but it won't work.
>
>
>     Can you help me? I feel this is a fairly common problem for __all__
> those
> people whose alphabet is not covered in the ASCII charset. And the
> documentation
> is not good regarding this issue.
>
>     TIA,
>     Regs
>     HL






More information about the Python-list mailing list