individually updating unicodedata db?

Vlastimil Brom vlastimil.brom at
Tue Mar 23 15:18:17 CET 2010

2010/3/23 Gabriel Genellina <gagsl-py2 at>:
> En Mon, 22 Mar 2010 21:19:04 -0300, Vlastimil Brom
> <vlastimil.brom at> escribió:
>> I guess, I am stuck here, as I use the precompiled version supplied in
>> the windows installer and can't compile python from source to obtain
>> the needed unicodedata.pyd.
> You can recompile Python from source, on Windows, using the free Microsoft®
> Visual C++® 2008 Express Edition.
> Fetch the required dependencies using Tools\buildbot\external.bat, and then
> execute PCbuild\env.bat and build.bat. See readme.txt in that directory for
> details. It should build cleanly.
> --
> Gabriel Genellina
> --

Thanks for the hints; i probably screwed some steps up in some way,
but the result seem to be working for the most part; I'll try to
summarise it just for the record (also hoping to get further
I used the official source tarball for python 2.6.5 from:

In the unpacked sources, I edited the file:

import re # added
# UNIDATA_VERSION = "5.1.0" # changed to:

Furthermore the following text files were copied to the same directory



furthermore there are some files in the subdirectories needed:

After running, the above headers are recreated from
the new unicode database and can be copied to the original locations
in the source


(while keeping the backups)

Trying to run
...\Python-2.6.5-src\Tools\buildbot\external.bat and other bat files,
I got quite a few path mismatches resulting in file ... not found

However, I was able to just open the solution file in Visual C++ 2008 Express:
set the build configuration to "release" and try to build the sources.

There were some errors in particular modules (which might be due to my
mistakes or ommissions, as this maybe shouldn't happen normally), but
the wanted
was generated and can be used in the original python installation:

the newly added characters, cf.:
seem to be available

 ⅐ (dec.: 8528)  (hex.: 0x2150) # ⅐ VULGAR FRACTION ONE SEVENTH (Number, Other)
 𐬀 (dec.: 68352)  (hex.: 0x10b00) # 𐬀 AVESTAN LETTER A (Letter, Other)

but some are not present; I noticed this for
the new CJK block - CJK Unified Ideographs Extension C (U+2A700..U+2B73F).
Probably this new range isn't taken into account for some reason.

All in all, I am happy to have the current version of the unicode
database available; I somehow expected this to be more complicated,
but on the other hand I can't believe this is the standard way of
preparing the built versions (with all the copying,checking and and
replacing the files); it might be possible, that the actual
distribution is built using some different tools (the trivial missing
import in would be found immediately, I guess).

I also wanted to ask, whether the missing characters might be a result
of my error in updating the unicode database, or could it be a problem
with the itself?

Thanks in advance and sorry for this long post.

More information about the Python-list mailing list