[Tutor] more encoding confusion
Kent Johnson
kent37 at tds.net
Fri Aug 3 21:13:00 CEST 2007
Jon Crump wrote:
> I'm parsing a utf-8 encoded file with lines characterized by placenames
> in all caps thus:
>
> HEREFORD, Herefordshire.
> ..other lines..
> HÉRON (LE), Normandie.
> ..other lines..
>
> I identify these lines for parsing using
>
> for line in data:
> if re.match(r'[A-Z]{2,}', line):
>
> but of course this catches HEREFORD, but not HÉRON.
>
> What sort of re test can I do to catch lines whose defining
> characteristic is that they begin with two or more adjacent utf-8
> encoded capital letters?
First you have to decode the file to a Unicode string.
Then build the set of matching characters and build a regex. For
example, something like this:
data = open('data.txt').read().decode('utf-8').splitlines()
uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode)
if unichr(i).isupper())
upperRe = u'^[%s]{2,}' % uppers
for line in data:
if re.match(upperRe, line):
With a tip of the hat to
http://tinyurl.com/yrl8cy
Kent
More information about the Tutor
mailing list