[Python-Dev] Split unicodeobject.c into subfiles

Tue Oct 23 02:50:32 CEST 2012

Hi,

I forked CPython repository to work on my "split unicodeobject.c" project:
http://hg.python.org/sandbox/split-unicodeobject.c

The result is 10 files (included the existing unicodeobject.c):

  1176 Objects/unicodecharmap.c
  1678 Objects/unicodecodecs.c
  1362 Objects/unicodeformat.c
   253 Objects/unicodeimpl.h
   733 Objects/unicodelegacy.c
  1836 Objects/unicodenew.c
  2777 Objects/unicodeobject.c
  2421 Objects/unicodeoperators.c
  1235 Objects/unicodeoscodecs.c
  1288 Objects/unicodeutfcodecs.c
 14759 total

This is just a proposition (and work in progress). Everything can be changed :-)

"unicodenew.c" is not a good name. Content of this file may be moved
somewhere else.

Some files may be merged again if the separation is not justified.

I don't like the "unicode" prefix for filenames, I would prefer a new directory.

--

Shorter files are easier to review and maintain. The compilation is
faster if only one file is modified.

The MBCS codec requires windows.h. The whole unicodeobject.c includes
it just for this codec. With the split, only unicodeoscodecs.c
includes this file.

The MBCS codec needs also a "winver" variable. This variable is
defined between the BLOOM filter and the unicode_result_unchanged()
function. How can you explain how these things are sorted? Where
should I add a new function or variable? With the split, the variable
is now defined very close to where is it used. You don't have to
scroll 7000 lines to see where it is used.

If you would like to work on a specific function, you don't have to
use the search function of your editor to skip thousands to lines. For
example, the 18 functions and 2 types related to the charmap codec are
now grouped into one unique and short C file.

It was already possible to extend and maintain unicodeobject.c (some
people proved it!), but it should now be much simpler with shorter
files.

Note: unicodeobject.c is also composed by the huge stringlib library
(4000 lines), which is shared with the bytes type.

--

* Objects/unicodeimpl.h

Private macros and prototype of private functions.

Many unicode_xxx() functions has been renamed to _PyUnicode_xxx() to
be able to reuse them in different files.

* Objects/unicodenew.c

Functions to create a new Unicode string (PyUnicode_New), convert
from/to UCS4 and wchar_t*, resize a string. The ugly part of the PEP
393.

* Objects/unicodeoperators.c

find, replace, compare, split, fill, etc.

* Objects/unicodeobject.c

"str" type with all methods, _string module and unicodeiter type.

* Objects/unicodeformat.c

PyUnicode_FromFormat() and PyUnicode_Format()

* Objects/unicodecodecs.c

Text codecs for Python Unicode strings:
   - PyUnicode_Decode()
   - PyUnicode_AsEncodedObject()
   - PyUnicode_DecodeUnicodeEscape()
   - PyUnicode_DecodeRawUnicodeEscape(), PyUnicode_AsRawUnicodeEscapeString()
   - _PyUnicode_DecodeUnicodeInternal()
   - PyUnicode_DecodeLatin1(), PyUnicode_AsLatin1String()
   - PyUnicode_AsASCIIString()
   - PyUnicode_EncodeDecimal()
   - many helpers for other codecs
   - ...

* Objects/unicodecharmap.c

Character Mapping Codec:
   - PyUnicode_BuildEncodingMap()
   - PyUnicode_DecodeCharmap()
   - PyUnicode_AsCharmapString()
   - PyUnicode_Translate()

* Objects/unicodeoscodecs.c

Operating system codecs: MBCS codec, locale (FS) codec => FS encode/decode.

* Objects/unicodeutfcodecs.c

UTF-7/8/16/32 codecs and ASCII decoder.

* Objects/unicodelegacy.c

Legacy and deprecated Unicode API: Py_UNICODE type.

Victor