Re: [Python-Dev] Internationalization Toolkit

Fredrik Lundh wrote:
footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ <http://www.pythonware.com/madscientist/> (and you can replace "unsigned short" with "whatever's suitable on this platform") Surely using a different type on different platforms means that we throw away the concept of a platform independent Unicode string? I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. Does this mean that to transfer a file between a Windows box and Solaris, an implicit conversion has to be done to go from 16 bits to 32 bits (and vice versa)? What about byte ordering issues? Or do you mean whatever 16 bit data type is available on the platform, with a standard (platform independent) byte ordering maintained? Mike da S

Mike wrote:
so? the interchange format doesn't have to be the same as the internal format, does it?
no problem at all: unicode has special byte order marks for this purpose (and utf-8 doesn't care, of course).
Or do you mean whatever 16 bit data type is available on the platform, with a standard (platform independent) byte ordering maintained?
well, my preference is a 16-bit data type in the plat- form's native byte order (exactly how it's done in the unicode module -- for the moment, it can use the platform's wchar_t, but only if it happens to be a 16-bit unsigned type). gives you good performance, compact storage, and cleanest possible code. ... anyway, I think it would help the discussion a little bit if people looked at (and played with) the existing code base. at least that'll change arguments like "but then we have to implement that" to "but then we have to maintain that code" ;-) </F>

/F writes
I second that. It is good enough for me (although my requirements arent stringent) - its been used on CE, so would slot directly into the win32 stuff. It is pretty much the consensus of the string-sig of last year, but as code! The only "problem" with it is the code that hasnt been written yet, specifically: * Encoders as streams, and a concrete proposal for them. * Decent PyArg_ParseTuple support and Py_BuildValue support. * The ord(), chr() stuff, and other stuff around the edges no doubt. Couldnt we start with Fredriks implementation, and see how the rest turns out? Even if we do choose to change the underlying Unicode implementation to use a different native encoding, the interface to the PyUnicode_Type would remain pretty similar. The advantage is that we have something now to start working with for the rest of the support we need. Mark.

On Fri, 12 Nov 1999, Mark Hammond wrote:
I agree with "start with" here, and will go one step further (which Mark may have implied) -- *check in* Fredrik's code. Cheers, -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
The interchange format (marshal + pickle) is defined as UTF-8, so there's no problem with endianness or missing bits w/r to shipping Unicode data from one platform to another.
Access to this mark will go into sys: sys.bom.
The 0.4 proposal fixes this to 16-bit unsigned short using UTF-16 encoding with checks for surrogates. This covers all defined standard Unicode character points, is fast, etc. pp... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Access to this mark will go into sys: sys.bom.
Can the name in sys be a little more descriptive? sys.byte_order_mark would be reasonable. I think that a support module (possibly unicodec) should provide constants for all four byte order marks as strings (2- & 4-byte, little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, etc. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
The abbreviation BOM is quite common w/r to Unicode.
Good idea... sys.bom should return the byte order mark (BOM) for the format used internally. The unicodec module should provide symbols for all possible values of this variable: BOM_BE: '\376\377' (corresponds to Unicode 0x0000FEFF in UTF-16 == ZERO WIDTH NO-BREAK SPACE) BOM_LE: '\377\376' (corresponds to Unicode 0x0000FFFE in UTF-16 == illegal Unicode character) BOM4_BE: '\000\000\377\376' (corresponds to Unicode 0x0000FEFF in UCS-4) BOM4_LE: '\376\377\000\000' (corresponds to Unicode 0x0000FFFE in UCS-4) Note that Unicode sees big endian byte order as being "correct". The swapped order is taken to be an indicator for a "wrong" format, hence the illegal character definition. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
The abbreviation BOM is quite common w/r to Unicode.
Yes: "w/r to Unicode". In sys, it's out of context and should receive a more descriptive name. I think using BOM in unicodec is good.
I'd also add BOM to be the same as sys.byte_order_mark. Perhaps even instead of sys.byte_order_mark (just to localize the areas of code that are affected).
Note that Unicode sees big endian byte order as being "correct". The
A lot of us do. ;-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
Guido proposed to add it to sys. I originally had it defined in unicodec. Perhaps a sys.endian would be more appropriate for sys with values 'little' and 'big' or '<' and '>' to be conform to the struct module. unicodec could then define unicodec.bom depending on the setting in sys. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Guido proposed to add it to sys. I originally had it defined in unicodec.
Well, he clearly didn't ask me! ;-)
This seems more reasonable, though I'd go with BOM instead of bom. But that's a style issue, so not so important. If your write bom, I'll write bom. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
M.-A. Lemburg writes:
The abbreviation BOM is quite common w/r to Unicode.
True.
I agree and believe that we can avoid putting it into sys altogether.
Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.
### unicodec.py ### import struct BOM = struct.pack('h', 0x0000FEFF) BOM_BE = '\376\377' ... If somebody needs the BOM, then they should go to unicodec.py (or some other module). I do not believe we need to put that stuff into the sys module. It is just too easy to create the value in Python. Cheers, -g p.s. to be pedantic, the pack() format could be '@h' -- Greg Stein, http://www.lyra.org/

[MAL]
[Greg Stein]
Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.
I can't speak to MAL's degree of certainty <wink>, but he's right about this stuff. There is only one BOM character, U+FEFF, which is the zero-width no-break space. The byte-swapped form is not only reserved, it's guaranteed never to be assigned to a character.

Mike wrote:
so? the interchange format doesn't have to be the same as the internal format, does it?
no problem at all: unicode has special byte order marks for this purpose (and utf-8 doesn't care, of course).
Or do you mean whatever 16 bit data type is available on the platform, with a standard (platform independent) byte ordering maintained?
well, my preference is a 16-bit data type in the plat- form's native byte order (exactly how it's done in the unicode module -- for the moment, it can use the platform's wchar_t, but only if it happens to be a 16-bit unsigned type). gives you good performance, compact storage, and cleanest possible code. ... anyway, I think it would help the discussion a little bit if people looked at (and played with) the existing code base. at least that'll change arguments like "but then we have to implement that" to "but then we have to maintain that code" ;-) </F>

/F writes
I second that. It is good enough for me (although my requirements arent stringent) - its been used on CE, so would slot directly into the win32 stuff. It is pretty much the consensus of the string-sig of last year, but as code! The only "problem" with it is the code that hasnt been written yet, specifically: * Encoders as streams, and a concrete proposal for them. * Decent PyArg_ParseTuple support and Py_BuildValue support. * The ord(), chr() stuff, and other stuff around the edges no doubt. Couldnt we start with Fredriks implementation, and see how the rest turns out? Even if we do choose to change the underlying Unicode implementation to use a different native encoding, the interface to the PyUnicode_Type would remain pretty similar. The advantage is that we have something now to start working with for the rest of the support we need. Mark.

On Fri, 12 Nov 1999, Mark Hammond wrote:
I agree with "start with" here, and will go one step further (which Mark may have implied) -- *check in* Fredrik's code. Cheers, -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
The interchange format (marshal + pickle) is defined as UTF-8, so there's no problem with endianness or missing bits w/r to shipping Unicode data from one platform to another.
Access to this mark will go into sys: sys.bom.
The 0.4 proposal fixes this to 16-bit unsigned short using UTF-16 encoding with checks for surrogates. This covers all defined standard Unicode character points, is fast, etc. pp... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Access to this mark will go into sys: sys.bom.
Can the name in sys be a little more descriptive? sys.byte_order_mark would be reasonable. I think that a support module (possibly unicodec) should provide constants for all four byte order marks as strings (2- & 4-byte, little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, etc. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
The abbreviation BOM is quite common w/r to Unicode.
Good idea... sys.bom should return the byte order mark (BOM) for the format used internally. The unicodec module should provide symbols for all possible values of this variable: BOM_BE: '\376\377' (corresponds to Unicode 0x0000FEFF in UTF-16 == ZERO WIDTH NO-BREAK SPACE) BOM_LE: '\377\376' (corresponds to Unicode 0x0000FFFE in UTF-16 == illegal Unicode character) BOM4_BE: '\000\000\377\376' (corresponds to Unicode 0x0000FEFF in UCS-4) BOM4_LE: '\376\377\000\000' (corresponds to Unicode 0x0000FFFE in UCS-4) Note that Unicode sees big endian byte order as being "correct". The swapped order is taken to be an indicator for a "wrong" format, hence the illegal character definition. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
The abbreviation BOM is quite common w/r to Unicode.
Yes: "w/r to Unicode". In sys, it's out of context and should receive a more descriptive name. I think using BOM in unicodec is good.
I'd also add BOM to be the same as sys.byte_order_mark. Perhaps even instead of sys.byte_order_mark (just to localize the areas of code that are affected).
Note that Unicode sees big endian byte order as being "correct". The
A lot of us do. ;-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
Guido proposed to add it to sys. I originally had it defined in unicodec. Perhaps a sys.endian would be more appropriate for sys with values 'little' and 'big' or '<' and '>' to be conform to the struct module. unicodec could then define unicodec.bom depending on the setting in sys. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
Guido proposed to add it to sys. I originally had it defined in unicodec.
Well, he clearly didn't ask me! ;-)
This seems more reasonable, though I'd go with BOM instead of bom. But that's a style issue, so not so important. If your write bom, I'll write bom. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
M.-A. Lemburg writes:
The abbreviation BOM is quite common w/r to Unicode.
True.
I agree and believe that we can avoid putting it into sys altogether.
Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.
### unicodec.py ### import struct BOM = struct.pack('h', 0x0000FEFF) BOM_BE = '\376\377' ... If somebody needs the BOM, then they should go to unicodec.py (or some other module). I do not believe we need to put that stuff into the sys module. It is just too easy to create the value in Python. Cheers, -g p.s. to be pedantic, the pack() format could be '@h' -- Greg Stein, http://www.lyra.org/

[MAL]
[Greg Stein]
Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.
I can't speak to MAL's degree of certainty <wink>, but he's right about this stuff. There is only one BOM character, U+FEFF, which is the zero-width no-break space. The byte-swapped form is not only reserved, it's guaranteed never to be assigned to a character.
participants (7)
-
Da Silva, Mike
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Greg Stein
-
M.-A. Lemburg
-
Mark Hammond
-
Tim Peters