[Tutor] Python for Windows: module re, re.LOCALE different fo r Idle and p ython shell?

Steckel, Ralf Ralf.Steckel at AtosOrigin.com
Thu Jul 29 13:22:11 CEST 2004


Hi Steve,

thanx for your suggestion (what actually made me to improve my script by
opening the file via codecs), but this doesn't fix the problem.

In my first script i used f.open() and lines = f.readlines() to get the
input. After converting the lines to unicode, printing out German Umlaute as
characters shows, that in Idle the Umlaute are printed correctly but in
Python shell some DOS-special chars are printed.

By using the codecs.open with encoding = 'iso-8859-1' my basic problem (re
doesn't recognize German Umlaute as valid characters in re in Python shell)
still exists.

Greetings,

Ralf

PS:
Please See Sample:

-file umlaute.txt:
öäüÄÖÜß
End.
-end file umlaute.txt

-script: umlaute.py:

import codecs
import re

r = re.compile('[\w]+', re.LOCALE )

f = codecs.open('umlaute.txt', 'r', 'iso-8859-1')

lines = f.readlines()

for line in lines:
    print 'line', line
    l = len(line)
    i = 0
    while i < l:
        print 'character:', line[i], ord(line[i])
        i = i + 1

    words = r.findall(line)
    print 'words:', words

f.close()

dummy = raw_input('<RETURN>')
-end script

-output from Idle:
>>> ================================ RESTART
================================
>>> 
line öäüÄÖÜß 
character: ö 246
character: ä 228
character: ü 252
character: Ä 196
character: Ö 214
character: Ü 220
character: ß 223
character: 
13
character: 
10
words: [u'\xf6\xe4\xfc\xc4\xd6\xdc\xdf']
line End. 
character: E 69
character: n 110
character: d 100
character: . 46
character: 
13
character: 
10
words: [u'End']
<RETURN>
>>> 
-end output from Idle

-output from python shell:
D:\Src\Python\wordcount>python umlaute.py
line öäüÄÖÜß

character: ö 246
character: ä 228
character: ü 252
character: Ä 196
character: Ö 214
character: Ü 220
character: ß 223
13aracter:
character:
10
words: []
line End.

character: E 69
character: n 110
character: d 100
character: . 46
13aracter:
character:
10
words: [u'End']
<RETURN>

D:\Src\Python\wordcount>
-end output from python shell


> -----Original Message-----
> From: Steve [mailto:lonetwin at gmail.com]
> Sent: Thursday, July 29, 2004 11:57 AM
> To: Steckel, Ralf
> Subject: Re: [Tutor] Python for Windows: module re, re.LOCALE 
> different
> for Idle and p ython shell?
> 
> 
> Hi Ralf,
>         Just a wild guess here ....haven't actually tried this ..
> 
> On Thu, 29 Jul 2004 09:25:45 +0200, Steckel, Ralf
> <ralf.steckel at atosorigin.com> wrote:
> > i've written a python script to extract all words from a 
> text file and to
> > print how often they are used. For doing that i use the re 
> module with:
> > 
> > r=re.compile('[\w]+', re.LOCALE | re.IGNORECASE)
> <...snip...> 
> > My question is: how do i get for the command line the same 
> environment as
> > for Idle?
> > 
> > I guess this is rather a Windows question than a Python 
> one, because Windows
> > and DOS both support German 'Umlaute', but it seems they do it with
> > different character codes.
> 
>          How are you actually passing the contents of the file to the
> re expression ? Probably you'd have to enforce your particular
> encoding before have the re parse the string. Something like:
> 
> s = file('foo.txt').read()
> unicode(s, <your_encoding>)
> re.search(r, s)
> 
> HTH
> Steve
> 


More information about the Tutor mailing list