[Python-bugs-list] [ python-Bugs-834676 ] segfault in unicodedata module (hangul syllables)

SourceForge.net noreply at sourceforge.net
Thu Nov 6 15:52:41 EST 2003

Bugs item #834676, was opened at 2003-11-02 20:28
Message generated for change (Comment added) made by loewis
You can respond by visiting: 

Category: Unicode
Group: Python 2.3
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Matthias Klose (doko)
Assigned to: Martin v. Löwis (loewis)
Summary: segfault in unicodedata module (hangul syllables)

Initial Comment:
[forwarded from http://bugs.debian.org/218697]

start an interactive python2.3 interpreter. run the
following command, twice if necessary: 
this reliably segfaults python2.3 on both i686 and
although my testing has not been very extensive
(unicode is somewhat large and slightly complex,) so
far i have seen the crash only when processing
pre-composed hangul syllables. decomposing them into
combining jamos before calling unicodedata.normalize
seems to avoid the crash, and i've included a wrapper
that does just that in this report. 
unfortunately, this method is used internally by
encodings.idna, so this means processing some
internationalized korean domain names can likely crash
any python program with support for internationalized
domain names. 
please do let me know if you would like more details,
or if there's anything further i can do to help! 

as a workaround in my own python programs, i wrapped
unicodedata.normalize like this (see attachment).


>Comment By: Martin v. Löwis (loewis)
Date: 2003-11-06 21:52

Logged In: YES 

This is now fixed in

test_normalization.py 1.9
unicodedata.c 2.29
NEWS 1.831.4.73

As an alternative work-around, adding three extra space
characters before normalization, and removing them
afterwards should also reliably avoid the crash.


Comment By: Neal Norwitz (nnorwitz)
Date: 2003-11-03 16:17

Logged In: YES 

Here's some more info, hope it helps.

>>> __import__('unicodedata').normalize('NFC',u'\ud55c\uae00')
Debug memory block at address p=0x402ac410:
    20 bytes originally requested
    The 4 pad bytes at p-4 are FORBIDDENBYTE, as expected.
    The 4 pad bytes at tail=0x402ac424 are not all
        at tail+0: 0xaf *** OUCH
        at tail+1: 0x11 *** OUCH
        at tail+2: 0x00 *** OUCH
        at tail+3: 0x00 *** OUCH
    The block was made by call #19688 to debug malloc/realloc.
    Data at p: 12 11 00 00 61 11 00 00 ... 00 11 00 00 73 11
00 00
Fatal Python error: bad trailing pad byte

#7  0x08082a7c in _PyObject_DebugRealloc (p=0x402ac410,
    at Objects/obmalloc.c:1038
#8  0x0809f1b7 in unicode_resize (unicode=0x40299988, length=6)
    at Objects/unicodeobject.c:150
#9  0x0809f6da in PyUnicodeUCS4_Resize (unicode=0xbffff4bc,
    at Objects/unicodeobject.c:298
#10 0x405d702e in nfd_nfkd (input=0x40299928, k=0)
    at /home/neal/build/python/2_3/Modules/unicodedata.c:356
#11 0x405d7233 in nfc_nfkc (input=0x40299928, k=0)
    at /home/neal/build/python/2_3/Modules/unicodedata.c:412
#12 0x405d7616 in unicodedata_normalize (self=0x0,
    at /home/neal/build/python/2_3/Modules/unicodedata.c:517
#13 0x0810a182 in PyCFunction_Call (func=0x4052553c,
arg=0x4052522c, kw=0x0)


Comment By: M.-A. Lemburg (lemburg)
Date: 2003-11-03 11:09

Logged In: YES 

Martin wrote this part. Assigning to him.


Comment By: Benjamin C. W. Sittler (bsittler)
Date: 2003-11-02 21:21

Logged In: YES 

fyi, i ran into this while adding an encoding similar to
IDNA [i believe it's an IDNA superset], but capable of
handling free text -- see
http://xent.com/~bsittler/icb_ace.py -- and my very first
test data was the word 한글, written in precomposed form.


You can respond by visiting: 

More information about the Python-bugs-list mailing list