[Tutor] more encoding confusion

Jon Crump jjcrump at myuw.net
Fri Aug 3 20:09:40 CEST 2007


I'm parsing a utf-8 encoded file with lines characterized by placenames in 
all caps thus:

HEREFORD, Herefordshire.
..other lines..
HÉRON (LE), Normandie.
..other lines..

I identify these lines for parsing using

for line in data:
     if re.match(r'[A-Z]{2,}', line):

but of course this catches HEREFORD, but not HÉRON.

What sort of re test can I do to catch lines whose defining characteristic 
is that they begin with two or more adjacent utf-8 encoded capital 
letters?


More information about the Tutor mailing list