[Tutor] more encoding confusion

Fri Aug 3 21:13:00 CEST 2007

Jon Crump wrote:
> I'm parsing a utf-8 encoded file with lines characterized by placenames 
> in all caps thus:
> 
> HEREFORD, Herefordshire.
> ..other lines..
> HÉRON (LE), Normandie.
> ..other lines..
> 
> I identify these lines for parsing using
> 
> for line in data:
>     if re.match(r'[A-Z]{2,}', line):
> 
> but of course this catches HEREFORD, but not HÉRON.
> 
> What sort of re test can I do to catch lines whose defining 
> characteristic is that they begin with two or more adjacent utf-8 
> encoded capital letters?

First you have to decode the file to a Unicode string.
Then build the set of matching characters and build a regex. For 
example, something like this:

data = open('data.txt').read().decode('utf-8').splitlines()

uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode)
                     if unichr(i).isupper())
upperRe = u'^[%s]{2,}' % uppers

for line in data:
   if re.match(upperRe, line):

With a tip of the hat to
http://tinyurl.com/yrl8cy

Kent