[Python-Dev] Python and the Unicode Character Database

Alexander Belopolsky alexander.belopolsky at gmail.com
Tue Nov 30 19:59:29 CET 2010


On Tue, Nov 30, 2010 at 1:29 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
..
>> I am not sure this belongs to the locale module, however.  It seems to
>> me, something like 'unicodealgo' for unicode algorithms would be more
>> appropriate.
>
> It could simply be in unicodedata if you split the implementation into a
> core C part and some Python bits.
>

Splitting unicodedata may not be a bad idea.  There are many more
pieces in UCD than covered by unicodedata. [1]  Hardcoding them all
into unicodedata module is hard to justify, but some are quite useful.
 For example, PropertyValueAliases.txt is quite useful for those like
myself who cannot remember what Pd or Zl category names stand for.
SpecialCasing.txt is required for proper casing, but is not currently
included in Python.  I would not want to change str.upper or str.title
because of this, but providing the raw info to someone who wants to
implement proper case mappings may not be a bad idea.  Blocks.txt is
certainly useful for any language-dependent processing.

On the other hand, I think we should keep Unicode data and Unicode
algorithms separate.  And the latter may not even belong to the Python
stdlib.

[1] http://unicode.org/Public/UNIDATA/


More information about the Python-Dev mailing list