[Python-Dev] PEP 277 (unicode filenames): please review
Jack Jansen
Jack.Jansen@oratrix.com
Wed, 14 Aug 2002 11:46:52 +0200
On Wednesday, August 14, 2002, at 08:33 , Martin v. Loewis wrote:
>> Do I misunderstand something, or this this a bug (limitation?) in the
>> unicode->latin-1 decoder?
>
> It's a limitation, in all codecs. Contributions of normalization code
> are welcome. Since this is hard work, this is unlikely to be fixed in
> Python 2.3 - unless somebody has a really good incentive for fixing
> it.
Why is this hard work? I would guess that a simple table lookup would
suffice, after all there are only a finite number of unicode characters
that can be split up, and each one can be split up in only a small
number of ways.
Wouldn't something like
for c in input:
if not canbestartofcombiningsequence.has_key(c):
output.append(c)
nlookahead = MAXCHARSTOCOMBINE
while nlookahead > 1:
attempt = lookahead next nlookahead bytes from input
if combine.has_key(attempt):
output.append(combine[attempt])
skip the lookahead in input
break
else:
output.append(c)
do the trick, if the two dictionaries are initialized intelligently?
--
- Jack Jansen <Jack.Jansen@oratrix.com>
http://www.cwi.nl/~jack -
- If I can't dance I don't want to be part of your revolution -- Emma
Goldman -