Python 2.5.1 ported to z/OS and EBCDIC
Hello. Based on Jean-Yves Mengant's work on previous versions, I have ported Python 2.5.1 to z/OS. A patch against current svn head is attached to http://bugs.python.org/issue1298. The same patch should work with very little changes also against pristine 2.5.1 sources. (The only failing hunk is for Modules/makesetup, and it is quite trivial.) I have no opinion on whether the patch should eventually be incorporated into the main distribution. The port was motivated by internal reasons, and I'm merely offering it as a community service to anyone else who might be interested. If Jean-Yves wishes to distribute it from his z/OS-page, that is fine with me. In general, anyone can do what they want with the patch, but please give credit. I'll describe some of the porting issues below. CHARACTER SETS ============== The biggest, major difficulty with z/OS is of course the character set. There are lots of ASCII-dependencies in Python code, and z/OS uses CP1047, an EBCDIC variant, which is utterly incompatible with ASCII. There are two possible approaches in this situation. One is to keep on using ASCII as the execution character set (and also as the default encoding of string objects), and to add conversion support to everywhere where we do text-based I/O, so that communication with the external world still happens in EBCDIC. This was feasible since the z/OS C compiler does support ASCII as the execution character set. (The source character set would still remain EBCDIC, though. If you've ever wondered why the C standard makes a distinction between these, here's a prime example of a situation where they're different.) However, I decided against this approach. The I/O conversions would have been deeply magical, and would have required classic "text mode vs. binary mode" -crap, which would be rather confusing. Instead, I followed Jean-Yves' example and kept Python as a "native" EBCDIC application: input, 8-bit data is treated by default as EBCDIC everywhere. This only required fixing various ASCII-specific bits in the code, e.g. stuff like this (in PyString_DecodeEscape): - else if (c < ' ' || c >= 0x7f) + else if (!isprint((unsigned char) c)) Of course, now this allows unescaped printing of characters if they are printable in the platform's encoding even if they wouldn't be printable in ASCII. I'm not sure if this is desirable or not. It would be simple to fix this so that only characters in the ASCII _character set_ are displayed varbatim. A result of making strings EBCDIC-native is that it breaks any code that depends on string literals being in ASCII. This probably applies to most network protocol implementations written in Python. On the other hand, making string literals use ASCII would break code that does ordinary text processing on local files. Damned if you do, damned if you don't. The real issue is that strings in Python are rather underspecified. String objects are really just octet sequences without any _inherent_ textual interpretation for them. This is apparent from the fact that strings are what are read from and written to binary files, and also what unicode strings are encoded to and decoded from. However, Python syntax allows specifying an octet sequence with a _character_ sequence (i.e. a string literal), and the relationship between the source characters and the resulting octets has been left implicit. So programmers aren't really encouraged to think about character set issues and the end result is code that only works on a platform that uses ASCII everywhere. Python already has the property that the meaning of a source file depends on its encoding: if I write a string literal with some latin-1 characters, the resulting octet sequence depends on whether my source was encoded in latin-1 or utf-8. I'm not sure if this is a good idea, but my approach with the z/OS port continues the tradition: when your source is in EBCDIC, the string literals get encoded in EBCDIC. All this just shows that treating plain octet sequences as "strings" simply won't work in the long run. You have to have separate type for _textual_ data (i.e. Unicode strings, in Python), and encode and decode between those and octet sequences using some _explicit_ encoding. Of course, all non-English-speaking people have been keenly aware of this already for ages. The relative universality of ASCII is an exception amongst encodings rather than the norm. It's only reasonable to require English text to require the same attention to encodings as all the other languages. UNICODE ------- The biggest hurdle by far (at least LoC-wise) in the porting was Unicode. The code assumed that the execution character set was not only ASCII, but ISO-8859-1, since there was lots of casting back and forth between Py_UNICODE and char. I added the following conversion operations into unicodeobject.h: #ifdef Py_CHARSET_ASCII # define Py_UNICODE_FROM_CHAR(c) ((Py_UNICODE)(unsigned char)(c)) # define Py_UNICODE_AS_CHAR(u) (u < 0x80 ? (char)(unsigned char)(u) : '\0') #else # define Py_UNICODE_FROM_CHAR(c) _PyUnicode_FromChar(c) # define Py_UNICODE_AS_CHAR(u) _PyUnicode_AsChar(u) #endif The Py_UNICODE_AS_CHAR operation maps a unicode character into a char in the execution character set's encoding, or to '\0' if it's not representable. When on a non-ASCII platform, I used the simplest trick of all: /* Map from ASCII codes to the platform's execution character set, or to '\0' if the corresponding character is not known. */ static const char unicode_ascii_table[128] = "\0\0\0\0\0\0\0\a\b\t\n\v\f\r\0\0" "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0" " !\"#$%&'()*+,-./0123456789:;<=>?" "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_" "`abcdefghijklmnopqrstuvwxyz{|}~\0"; (This is reasonably portable, as all the printable ASCII characters except `, @ and $ are required by C to be present in any source or execution character set, and of those, Python requires all but $.) This, and the corresponding reverse index, are good enough for all purposes in the Python core: converting unicode string literals into unicode objects, detecting special escape characters, and calculating digit values. It doesn't allow writing string or unicode literals that directly contain characters that don't exist in ASCII, though. But since such code wouldn't be portable across character sets anyway, this isn't much of a problem. I also added a Lib/encodings/cp1047.py that does proper recoding outside the core. It was generated from jdk-1.5.0/CP1047.TXT (from http://haible.de/bruno/charsets/conversion-tables/CP1047.html). This map seems to best correspond to the actual conventions I have seen on a z/OS machine. Now, strings and unicode seem to work together fairly well, even though the results may be a bit surprising to one used to ASCII and its extensions:
ord('a') 129
Here 129 is the EBCDIC value of the letter 'a'. The unicode literal u'a', like all textual input, is itself represented in EBCDIC:
map(ord,"u'a'") [164, 125, 129, 125]
But when such a literal is parsed, the resulting unicode object has the correct value for the corresponding unicode character:
ord(u'a') 97
And, of course, when this unicode literal is printed back or its repr is taken, it is again encoded to EBCDIC so it shows correctly:
map(ord,repr(u'a')) [164, 125, 129, 125]
This seems to me to be the Right Thing. Now, as long as no exotic characters are used directly in the source, source can be translated between ASCII and EBCDIC so that strings and unicode strings retain their correct semantic character values, even though the encoding of the literals themselves is different. String objects have a platform-dependent encoding, but unicode objects behave the same everywhere. One problem with this approach is that it is completely incompatible with Python's UTF-8 support. The parser assumes that utf-8 (or latin-1) are supersets of the platform's native encoding, and this of course isn't true with EBCDIC. A consequence is that the z/OS port cannot support eval of unicode strings:
eval('2+2') 4 eval(u'2+2') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1
^ SyntaxError: invalid syntax This is because internally evaluation of unicode strings is implemented by first encoding the unicode string as utf-8, and then trying to parse that. And this of course fails. This seems like a rather complicated and limited way of going about it. It would be much cleaner and more portable to first decode input into unicode by various means, and then to parse the unicode. Then unicode strings would be the ones that don't need any special processing. But this would require heavy changes to Python's parsing machinery, and I tried to keep my changes as minimal as possible for now. PICKLING -------- One more character set issue arised with pickling. The pickle protocols are a bit schitzophrenic in the sense that they can't quite decide whether to be textual or binary protocols. A textual protocol should be readable, and recodable across platforms to preserve the semantic character values correctly, whereas a binary protocol should be based on specific octet values whose readability is not an issue. The original pickle protocol 0 can be seen either as a textual protocol (all the pickles are readable), or a binary protocol (when characters get mapped to their corresponding octet values in ASCII). The other protocol versions, though extensions of protocol 0, are clearly binary, since the pickled data is at least partially specified as specific octet values. Now, on an EBCDIC platform, it's impossible to have protocol 0 be textual while still compatible with the other protocols. This is because e.g. the following opcodes get the same value if we let 'a' be textual (i.e. encoded in the host platforms's encoding): APPEND = 'a' # append stack top to list below it NEWOBJ = '\x81' # build object by applying cls.__new__ to argtuple In the end, for now, I made protocol 0 textual, and disabled support for protocol versions > 0 on non-ASCII platforms. This seems like the safest choice. It's certainly possible to add support for the binary protocols and make them explicitly use ASCII, but that again would require non-trivial changes. Incidentally, modified_EncodeRawUnicodeEscape in cPickle.c seems to be out of sync with the one in unicodeobject.c, in that it lacks support for Py_UNICODE_WIDE. Also, both versions generate a latin-1 string as output, which doesn't seem portable enough. My patch recodes characters in ASCII to the execution character set, and escapes everything else, even characters in U+0080 - U+00FF -range. (Though strictly, all the latin-1 characters happen to be representable in CP1047. But this is not something that I think it's good to depend upon.) INTEGER PARSING --------------- There were quite a number of places where (hex) digits were parsed nonportably. I added the following to longobject.h, and used that: PyAPI_FUNC(int) _PyLong_DigitValue(char c); This resulted in some nice cleanups. From PyString_DecodeEscape: - unsigned int x = 0; - c = Py_CHARMASK(*s); - s++; - if (isdigit(c)) - x = c - '0'; - else if (islower(c)) - x = 10 + c - 'a'; - else - x = 10 + c - 'A'; - x = x << 4; - c = Py_CHARMASK(*s); - s++; - if (isdigit(c)) - x += c - '0'; - else if (islower(c)) - x += 10 + c - 'a'; - else - x += 10 + c - 'A'; - *p++ = x; + int xh = _PyLong_DigitValue(*s++); + int xl = _PyLong_DigitValue(*s++); + *p++ = Py_CHARMASK(xh * 16 + xl); break; OTHER ISSUES ============ Most of the other changes are boring build-technical issues and tweaks to make things compile on z/OS's very spartan support for Unix-like facilities. I hard-coded various #ifdef __MVS__ bits here and there to make things compile. I guess these things should properly be checked by configure, but I'm not very good at autoconf magic, and besides, running configure takes _ages_ on the machine I'm using, so I wasn't inclined to tweak the scripts any more than I had to. The dynamic loading support in dynload_mvs.c is verbatim from Jean-Yves' modifications. I just cleaned it up a little. I have only tested this with --enable-shared (which does what --with-zdll did in Jean-Yves' version, i.e. enables shared libraries). Without shared libraries the building of extensions may well fail because of some linkage tweaks in Lib/distutils/unixccompiler.py. I hope there is some way of deciding what to do depending on whether shared libraries are enabled or not. One nasty difficulty was that the makefile implicitly assumes that shared libraries are named libpython2.x.dll only on Windows. However, they have that name on z/OS, too. I resolved this with a simple "case $(MACHDEP)" in the rule for building the library, but hopefully someone can come up with a prettier solution. Various wrappers for external libraries are untested. Certainly it might be possible to install zlib, libbz2, openssl and various other nifty libraries on z/OS, and see if the Python wrappers work, but that is an undertaking that I will pass at least for now. Quite a number of tests fail simply because they assume that strings are encoded in ASCII. For instance, Lib/test/test_calendar.py fails because the expected result is: result_2004_html = """ <?xml version="1.0" encoding="ascii"?> ... """ And the real result begins with: <?xml version="1.0" encoding="cp1047"?> ... There were so many of these kinds of failures that there may be some _actual_ problems amongs them that I've overlooked. That is about all. Comments are welcome. I'd be especially interested in hearing if my patch works on any other machine besides the one I was using. :) -- Lauri Alanko Software Engineer SSH Communications Security Corp Mobile: +358-40-864-3037 Valimotie 17, FI-00380, Helsinki, Finland Tel: +358-20-500-7000 http://www.ssh.com/ Fax: +358-20-500-7001
"Lauri Alanko"
participants (3)
-
Greg Ewing
-
Lauri Alanko
-
Terry Reedy