[Python-Dev] PEP 277 (unicode filenames): please review

Jack Jansen Jack.Jansen@oratrix.com
Wed, 14 Aug 2002 11:46:52 +0200


On Wednesday, August 14, 2002, at 08:33 , Martin v. Loewis wrote:
>> Do I misunderstand something, or this this a bug (limitation?) in the
>> unicode->latin-1 decoder?
>
> It's a limitation, in all codecs. Contributions of normalization code
> are welcome. Since this is hard work, this is unlikely to be fixed in
> Python 2.3 - unless somebody has a really good incentive for fixing
> it.

Why is this hard work? I would guess that a simple table lookup would 
suffice, after all there are only a finite number of unicode characters 
that can be split up, and each one can be split up in only a small 
number of ways.

Wouldn't something like
for c in input:
	if not canbestartofcombiningsequence.has_key(c):
		output.append(c)
      nlookahead = MAXCHARSTOCOMBINE
      while nlookahead > 1:
		attempt = lookahead next nlookahead bytes from input
		if combine.has_key(attempt):
			output.append(combine[attempt])
			skip the lookahead in input
			break
	else:
		output.append(c)
do the trick, if the two dictionaries are initialized intelligently?
		
--
- Jack Jansen        <Jack.Jansen@oratrix.com>        
http://www.cwi.nl/~jack -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -