[Python-Dev] Security implications of pep 383

Victor Stinner victor.stinner at haypocalc.com
Tue Mar 29 22:23:14 CEST 2011

Le mardi 29 mars 2011 à 19:23 +0100, Michael Foord a écrit :
> Hey all,
> Not sure how real the security risk is here:
>      http://blog.omega-prime.co.uk/?p=107
> Basically  he is saying that if you store a list of blacklisted files 
> with names encoded in big-5 (or some other non-utf8 compatible encoding) 
> if those names are passed at the command line, or otherwise read in and 
> decoded from an assumed-utf8 source with surrogate escaping, the 
> surrogate escape decoded names will not match the properly decoded 
> blacklisted names.

Yes, if you decode two byte strings from two different encodings, you
get different unicode strings. It's not related to surrogateescape (PEP

Sorry, '\u4f60\u597d'.encode('big5').decode('latin1') doesn't give you
'\u4f60\u597d' but '§A¦n', and it doesn't warn you that latin1 is not
big5 (there is no UnicodeEncodeError, even if the error handler is

I think that the example has two issues:

 - security using blacklists doesn't work (it is better to use 
   a whitelist)
 - if filenames are stored as Big5, they must be decoded from Big5,
   and so the locale encoding must be Big5

I don't understand the last paragraph:

"P.P.S I will further note that you get the same issue even if the
blacklist and filename had been in UTF-8, but this time it gets broken
from a terminal in the Big5 locale. I didn’t show it this way around
because I understand that Python 3 may only have just recently started
using the locale to decode argv, rather than being hardcoded to UTF-8."

Python filesystem encoding is only hardcoded to UTF-8 on Mac OS X, on
other operating systems, it is the locale encoding.


More information about the Python-Dev mailing list