What type of object mmap.read_byte should return on py3k?

Hello. I noticed mmap.read_byte returns 1-length unicode on py3k. I felt this was strange, so I created issue on bug tracker (http://bugs.python.org/issue5391) and Martin proposed this is suitable for discussion on python-dev. I'll quote messages on bug tracker here. I wrote:
On Python3000, mmap.read_byte returns str not bytes, and mmap.write_byte accepts str. Is this intended behavior?
import mmap m = mmap.mmap(-1, 10) type(m.read_byte()) <class 'str'> m.write_byte("a") m.write_byte(b"a")
Maybe another possibility. read_byte() returns int which represents byte, write_byte accepts int which represents byte. (Like b"abc"[0] returns int not 1-length bytes)
Martin wrote:
Indeed, I think it should use the "b" code, instead of the "c" code. Please discuss this on python-dev, though.
It might not be ok to backport this to 3.0, since it may break existing code.
Furthermore, all other uses of the "c" code might need to be reconsidered.

Hirokazu Yamamoto wrote:
Hello. I noticed mmap.read_byte returns 1-length unicode on py3k. I felt this was strange, so I created issue on bug tracker (http://bugs.python.org/issue5391) and Martin proposed this is suitable for discussion on python-dev. I'll quote messages on bug tracker here.
I wrote:
On Python3000, mmap.read_byte returns str not bytes, and mmap.write_byte accepts str. Is this intended behavior?
import mmap m = mmap.mmap(-1, 10) type(m.read_byte()) <class 'str'> m.write_byte("a") m.write_byte(b"a")
Maybe another possibility. read_byte() returns int which represents byte, write_byte accepts int which represents byte. (Like b"abc"[0] returns int not 1-length bytes)
Martin wrote:
Indeed, I think it should use the "b" code, instead of the "c" code. Please discuss this on python-dev, though.
It might not be ok to backport this to 3.0, since it may break existing code.
Furthermore, all other uses of the "c" code might need to be reconsidered.
It certainly seems like mmap should be playing in an all-bytes world (where only already encoded strings are allowed). On the specific question of whether it would be better for read_byte()/write_byte to use 1-length bytes objects or integers, I have no strong opinion (the former is closer to the 2.x class API, the later more consistent with the operation of the 3.x bytes class). However, as Martin says, it wouldn't be reasonable to backport the fixes in this to 3.0 - the associated API changes would almost certainly break otherwise working code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

It certainly seems like mmap should be playing in an all-bytes world (where only already encoded strings are allowed).
Agreed.
On the specific question of whether it would be better for read_byte()/write_byte to use 1-length bytes objects or integers, I have no strong opinion (the former is closer to the 2.x class API, the later more consistent with the operation of the 3.x bytes class).
Personally, I was surprised when I saw b"0123"[1] != b"1". But I don't have strong opinion neither.
However, as Martin says, it wouldn't be reasonable to backport the fixes in this to 3.0 - the associated API changes would almost certainly break otherwise working code.
Agreed. I greped py3k source tree with "c", I found another Py_BuildValue("c" in curse module. But this function returns unicode in else clause, so probably this is correct usage. Modules\mmapmodule.c(207): return Py_BuildValue("c", value); Modules\_cursesmodule.c(893): return Py_BuildValue("c", rtn); Modules\_dbmmodule.c(380): else if ( strcmp(flags, "c") == 0 ) Modules\_ctypes\cfield.c(112): if (idict->getfunc == getentry("c")->getfunc) { Modules\_ctypes\stgdict.c(459): if (dict->getfunc != getentry("c")->getfunc Modules\_ctypes\_ctypes.c(1372): if (itemdict->getfunc == getentry("c")->getfunc) { Modules\_ctypes\_ctypes.c(1536): if (dict && (dict->setfunc == getentry("c")->setfunc)) { Modules\_ctypes\_ctypes.c(1545): if (dict && (dict->setfunc == getentry("c")->setfunc)) { Modules\_ctypes\_ctypes.c(4197): if (itemdict->getfunc == getentry("c")->getfunc) { Modules\_ctypes\_ctypes.c(4890): if (itemdict->getfunc == getentry("c")->getfunc) { PC\os2emx\getpathp.c(128): strcat(filename, Py_OptimizeFlag ? "o" : "c"); Python\import.c(1756): strcpy(buf+i, Py_OptimizeFlag ? "o" : "c");

Le Saturday 28 February 2009 15:06:38 Hirokazu Yamamoto, vous avez écrit :
I greped py3k source tree with "c", I found another Py_BuildValue("c" in curse module. But this function returns unicode in else clause, so probably this is correct usage.
I used different regex on to catch "...c..." with Py_BuildValue and PyArg_Parse... because a function may have other arguments or specify the function name with "...:name": http://bugs.python.org/issue5391 It looks like msvcrt.putch(char) and msvcrt.ungetch(char) use the wrong types. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

About m.read_byte(), we have two choices: (a) Py_BuildValue("b", value) => 0 (b) Py_BuildValue("y#", &value, 1) => b"\x00" About m.write_byte(x), we have also two choices: (a) PyArg_ParseTuple(args, "b:write_byte", &value): write_byte(0) (b) PyArg_ParseTuple(args, "y#:write_byte", &value, &length) and check for length=1: write_byte(b"\x00") (b) choices are close to Python 2.x API. But we can already use m.read(1)->b"\x00" and m.write(b"\x00") to use byte string of 1 byte. So it would be better to break the API and use integers, (a) choices which require also documentation changes: mmap.read_byte() Returns a string of length 1 containing the character at the current file position, and advances the file position by 1. mmap.write_byte(byte) Write the single-character string byte into memory at the current position of the file pointer; the file position is advanced by 1. If the mmap was created with ACCESS_READ, then writing to it will throw a TypeError exception. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

Victor Stinner wrote:
About m.read_byte(), we have two choices: (a) Py_BuildValue("b", value) => 0 (b) Py_BuildValue("y#", &value, 1) => b"\x00"
About m.write_byte(x), we have also two choices: (a) PyArg_ParseTuple(args, "b:write_byte", &value): write_byte(0) (b) PyArg_ParseTuple(args, "y#:write_byte", &value, &length) and check for length=1: write_byte(b"\x00")
(b) choices are close to Python 2.x API. But we can already use m.read(1)->b"\x00" and m.write(b"\x00") to use byte string of 1 byte. So it would be better to break the API and use integers, (a) choices which require also documentation changes:
I'm +1 for (a) because mmap.__getitem__ already returns integer not 1-length bytes. And as I wrote in http://bugs.python.org/msg82912, it seems that more bytes cleanup is needed in mmap documentaion/implementation. I hope someone else will look into other modules' ones. ;-)

I uploaded the patch with choice (a) http://bugs.python.org/file13215/py3k_mmap_and_bytes.patch If (b) is suitable, I'll rewrite the patch.

On Sun, Mar 1, 2009 at 10:45 AM, Hirokazu Yamamoto <ocean-city@m2.ccsnet.ne.jp> wrote:
I uploaded the patch with choice (a) http://bugs.python.org/file13215/py3k_mmap_and_bytes.patch If (b) is suitable, I'll rewrite the patch. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com
Has anyone been using mmap in python 3k to know what is more intuitive? When I was using mmap in python 2.4, I never used the read/write methods, I stuck with slicing, which was very convenient with 2.4 non-unicode strings. I don't really have an intuition on 3.x bytes. - Josiah
participants (4)
-
Hirokazu Yamamoto
-
Josiah Carlson
-
Nick Coghlan
-
Victor Stinner