[Python-Dev] Security Advisory for unicode repr() bug?

Sun Oct 8 01:57:04 CEST 2006

(i'm not on python-dev, so i dunno whether this will make it through...)

basically, this bug does not affect the vast majority (mac and windows
users with UTF-16 "narrow" unicode Python builds) because the unpatched
code allocates sufficient memory in this case. only the minority
treating this as a serious vulnerability (linux users with UTF-32 "wide"
unicode Python builds, possibly some other Unix-like operating systems
too) are affected by the buffer overrun.

as for secunia, they need to do their own homework ;)

i found this bug and wrote the patch that's been applied by the linux
distros, so i thought i should clear up a couple of apparent
misconceptions. please pardon me if i'm writing stuff you already
know...

the bug concerns allocation in repr() for unicode objects. previously
repr() always allocated 6 bytes in the output buffer per input unicode
string element; this is enough for the six-byte "\uffff" notation and on
UTF-16 python builds enough for the ten-byte "\U0010ffff" notation,
since on UTF-16 python builds the input unicode string contains a
surrogate pair (two consecutive elements) to represent unicode
characters requiring this longer notation, meaning five bytes per
element. however on UTF-32 builds ten bytes per unicode string element
are needed, and this is what the patch accomplishes. the previous
(incorrect) algorithm extended the buffer by 100 bytes in some cases
when encountering such a character, however this fixed-size heuristic
extension fails when the string contains many subsequent characters in
the six-byte "\uffff" form, as demonstrated by this test which will fail
in an unpatched non-debug wide python build:

python2.4 -c 'assert(repr(u"\U00010000" * 39 + u"\uffff" * 4096)) ==
(repr(u"\U00010000" * 39 + u"\uffff" * 4096))'

yes, a sufficiently motivated person could probably discover enough
about the memory layout of a process to use this for data or code
injection, but the more usual (and sometimes accidental) consequence is
a crash.

more background:

python comes in two flavors, UTF-16 ("narrow") and UTF-32 ("wide"),
depending on whether the unicode chars are represented. This is
generally configured to match the C library's wchar_t.

UTF-16: Windows (at least 32-bit builds), Mac OS X (at least 32-bit
builds), probably others too -- this uses a 16-bit variable-length
encoding for Unicode characters: 1 16-bit word for U+0000 ... U+FFFF
(identity mapped to 0x0000 ... 0xffff resp., a.k.a. the "UCS-2" range or
Basic Multilingual Plane) and 2 16-bit words for U+00010000 ... U
+0010FFFF (mapped as "surrogate pairs" to 0xd800; 0xdc00 ... 0xdbff;
0xdfff resp., corresponding to planes 1 through 16.)

UTF-32/UCS-4: Linux, possibly others? -- this uses 1 32-bit word per
unicode character: 1 word for all codepoints allowed by Python U
+0000 ... U+0010FFFF (identity mapped to 0x00000000L ... 0x0010ffffL
resp.)

> On 10/7/06, skip[at]pobox.com <skip[at]pobox.com> wrote: 
> > 
> > Georg> [ Bug http://python.org/sf/1541585 ] 
> > 
> > Georg> This seems to be handled like a security issue by linux 
> > Georg> distributors, it's also a news item on security related
> pages. 
> > 
> > Georg> Should a security advisory be written and official patches
> be 
> > Georg> provided? 
> > 
> > I asked about this a few weeks ago. I got no direct response.
> Secunia sent 
> > mail to webmaster and the SF project admins asking about how this
> could be 
> > exploited. (Isn't figuring that stuff out their job?) 
> 
> FWIW, I responded to the original mail from Secunia with what little
> I 
> know about the problem. Everyone on the original mail was copied. 
> However, I got ~30 bounces for all the Source Forge addresses due to 
> some issue between SF and Google mail. 
> 
> n