[Python-Dev] py3k: accept unicode for 'c' and byte for 'C' in getarg?

Tue Mar 17 13:52:16 CET 2009

Hi,

I realised with the issue #3446 that getarg('c') (get a byte) accepts not only 
a byte string of 1 byte, but also an unicode string of 1 character (if the 
character code is in [0; 255]). I don't think that it's a good idea to accept 
unicode here. Example: b"x".center(5, "\xe9") should be a TypeError.

The "C" format (get a character) has the opposite problem: it accepts both 
byte and unicode, whereas byte should be rejected. Example: 
mmap.write_byte('é') should be a TypeError.

The problem was already discuss in the email thread "What type of object 
mmap.read_byte should return on py3k?" started by Hirokazu Yamamoto related 
to issue #5391.

Short history:
 - r55109: Guido changes 'c' format to accept unicode (struni branch).
   getarg('c') => char accepts byte and character
 - r56044: walter.doerwald changes the 'c' format to return an int (an 
   unicode character) for datetime.datetime.isoformat().
   getarg('c') => int accepts byte and character
 - r56140: Revert r56044 and creates 'C' format
   getarg('c') => char accepts byte and character
   getarg('C') => int accepts byte and character

So we have:
 - getarg('c') -> one byte (integer in [0; 255])
 - getarg('C') -> one character (code in [0; INTMAX])
   Note: Why not using Py_UNICODE instead of int?

Usage of "C" format:
  datetime.datetime.isoformat(sep)
  array.array(type, data): type

Usage of "c" format:
  msvcrt.putch(char)
  msvcrt.ungetch(char)
  <mmap object>.write_byte(char)

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/