[Python-Dev] PEP 277 (unicode filenames): please review
Martin v. Loewis
martin@v.loewis.de
14 Aug 2002 20:35:41 +0200
Jack Jansen <Jack.Jansen@oratrix.com> writes:
> Why is this hard work? I would guess that a simple table lookup would
> suffice, after all there are only a finite number of unicode
> characters that can be split up, and each one can be split up in only
> a small number of ways.
Canonical decomposition requires more than that: you not only need to
apply the canonical decomposition mapping, but also need to put the
resulting characters into canonical order (if more than one combining
character applies to a base character).
In addition, a na=EFve implementation will consume large amounts of
memory. Hangul decomposition is better done algorithmitically, as we
are talking about 11172 precombined characters for Hangul alone.
> Wouldn't something like
> for c in input:
> if not canbestartofcombiningsequence.has_key(c):
> output.append(c)
> nlookahead =3D MAXCHARSTOCOMBINE
> while nlookahead > 1:
> attempt =3D lookahead next nlookahead bytes from input
> if combine.has_key(attempt):
> output.append(combine[attempt])
> skip the lookahead in input
> break
> else:
> output.append(c)
> do the trick, if the two dictionaries are initialized intelligently?
No, that doesn't do canonical ordering. There is a lot more to
normalization; the hard work is really in understanding what has to be
done.
Regards,
Martin