[Tutor] UTF-8 title() string method
Jon Crump
jjcrump at myuw.net
Thu Jul 5 19:29:54 CEST 2007
On Wed, 4 Jul 2007, Kent Johnson wrote:
> First, don't confuse unicode and utf-8.
Too late ;-) already pitifully confused.
> Second, convert the string to unicode and then title-case it, then convert
> back to utf-8 if you need to:
I'm having trouble figuring out where, in the context of my code, to
effect these translations. In parsing the text file, I depend on matching
a re:
if re.match(r'[A-Z]{2,}', line)
to identify and process the place name data. If I translate the line to
unicode, the re fails.
The whole program isn't very long, so at the risk of embarrassing myself,
I'm including the whole ugly, kludgy thing below. I hope I'm not hereby
violating any conventions of the list. Kent will recognize the ranges()
function (which works a treat, by the way, and was very instructive,
thanks).
In addition to the title case problem, if anyone has pointers on how to
make this look a little less like a Frankenstein's monster (all improbably
bolted together), such tutelage would be gratefully recieved.
The end product is intended to be a simple xml file and, apart from the
title case problem it works well enough. A sample of the text file input
is included at the bottom.
#!/usr/bin/python
import re
input = open('sample.txt', 'r')
text = input.readlines()
months = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5,
'June':6, 'July':7, 'August':8, 'September':9, 'October':10,
'November':11, 'December':12}
def ranges(data):
i = iter(data)
first = last = i.next()
try:
while True:
next = i.next()
if next > last+1:
yield (first, last)
first = last = next
else:
last = next
except StopIteration:
yield (first, last)
def parse_month_string(monthstring, year, title):
res=[]
monthstring_regex = re.compile('^(\w+)\s+(\d.*)\.$')
monthstring_elements = monthstring_regex.match(monthstring)
month = monthstring_elements.group(1)
days = ranges([int(x) for x in re.split(',',
monthstring_elements.group(2))])
for start, end in days:
if start == end:
res.append('<event start="%s-%02d-%02d" title="%s" />' %
(year, months[month], start, title.strip()))
else:
res.append('<event start="%s-%02d-%02d" end="%s-%02d-%02d"
isDuration="true" title="%s" />' % (year, months[month], start, year,
months[month], end, title.strip()))
return res
def parse_year_string(yearstring, title):
res=[]
yearstring_regex = re.compile('(\d\d\d\d)\.\s+(\w+)\s+(\d.*)\.$')
yearstring_elements = yearstring_regex.match(yearstring)
year = yearstring_elements.group(1)
month = yearstring_elements.group(2)
days = ranges([int(x) for x in re.split(',',
yearstring_elements.group(3))])
for start, end in days:
if start == end:
res.append('<event start="%s-%02d-%02d" title="%s" />' %
(year, months[month], start, title.strip()))
else:
res.append('<event start="%s-%02d-%02d" end="%s-%02d-%02d"
isDuration="true" title="%s" />' % (year, months[month], start, year,
months[month], end, title.strip()))
return res
def places(data):
place=[data[0]]
for line in data:
if re.match(r'[A-Z]{2,}', line):
if place:
yield place
place = []
place.append(line.strip())
elif re.match(r'(\d\d\d\d)\.\s+(\w+)\s+(\d.*)\.$', line):
yearstring_regex =
re.compile('(\d\d\d\d)\.\s+(\w+)\s+(\d.*)\.$')
yearstring_elements = yearstring_regex.match(line)
year = yearstring_elements.group(1)
title = place[0]
place.append(parse_year_string(line, title))
elif re.match(r'^(\w+)\s+(\d.*)\.$', line):
place.append(parse_month_string(line, year, title))
yield place
for x in places(text):
for y in x[1:]:
for z in y:
print z
#############
here begins sample of the text file input:
ABERGAVENNY, Monmouthshire.
1211. March 12.
ALENÇON, Normandie.
1199. November 3.
1200. September 6, 7.
1201. July 18.
1202. February 20, 21.
August 8, 9, 10, 12.
September 29.
1202. October 3, 29.
December 7.
1203. January 15, 16, 17, 18, 19, 25.
August 11, 12, 13, 14, 15.
ALLERTON, Yorkshire.
1201. February 28.
1212. June 29.
September 1, 2, 6.
1213. February 6, 7.
September 16.
1216. January 6.
ANDELY (le Petit), Normandie.
1199. August 18, 19, 28, 29, 30, 31.
September 1.
October 21, 26, 27.
1200. January 11.
May 11, 12, 17.
May 18, 19, 20, 21, 22, 23, 24, 25, 26.
1201. June 9, 10, 11, 25, 26, 27.
October 23, 24, 25, 26, 28.
December 15.
1202. March 27, 28, 29.
April 4, 22, 23, 24, 25, 26, 28.
ANGERS, Anjou.
1200. June 18, 19, 20, 21.
1202. September 4, 15.
1206. September, 8, 9, 10, 11, 12, 13, 20, 21.
1214. June 17, 18.
ANGOULÊME, Angoumois.
1200. August 26.
1202. February 4, 5.
1214. March 13, 14, 15.
April 5, 6.
July 28, 29, 30.
August 17, 18.
ANVERS-LE-HOMONT, Maine.
1199. September 18.
More information about the Tutor
mailing list