[issue9769] PyUnicode_FromFormatV() doesn't handle non-ascii text correctly

STINNER Victor report at bugs.python.org
Fri Nov 19 21:06:23 CET 2010


STINNER Victor <victor.stinner at haypocalc.com> added the comment:

On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky at users.sourceforge.net> added the comment:
> 
> I don't understand Victor's argument in msg115889.  According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
> 
>    -  US-ASCII values do not appear otherwise in a UTF-8 encoded
>       character stream.  This provides compatibility with file systems
>       or other software (e.g. the printf() function in C libraries) that
>       parse based on US-ASCII values but are transparent to other
>       values.

Most C functions including printf works on multi*byte* strings, not on (wide) 
character strings. Whereas PyUnicode_FromFormatV() converts the format string 
(bytes) to unicode (characters). If you would like a comparaison in C, it's 
like printf()+mbstowcs() in the same function.

> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. 

It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV() 
of Python2), but it's no more true with bytes input and str output (eg. 
PyUnicode_FromFormatV() of Python3).

> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.

I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode 
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= 
byte <= 127).

Nobody noticed my change just because the whole Python code base only uses 
ASCII argument for the format argument of PyUnicode_FromFormatV().

Victor

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9769>
_______________________________________


More information about the Python-bugs-list mailing list