[Python-Dev] Split unicodeobject.c into subfiles

Thu Oct 25 06:22:03 CEST 2012

Nick Coghlan writes:

 > OK, I need to weigh in after seeing this kind of reply. Large source files
 > are discouraged in general because they're a code smell that points
 > strongly towards a *lack of modularity* within a *complex piece of
 > functionality*.

Sure, but large numbers of tiny source files are also a code smell,
the smell of purist adherence to the literal principle of modularity
without application of judgment.

If you want to argue that the pragmatic point of view nevertheless is
to break up the file, I can see that, but I think Victor is going too
far.  (Full disclosure dept.: the call graph of the Emacs equivalents
is isomorphic to the Dungeon of Zork, so I may be a bit biased.)  You
really should speak to the question of "how many" and "what partition".

 > the real gain is in *modularity*, making it clear to readers which
 > parts can be understood and worked on separately from each other.

Yeah, so which do you think they are?  It seems to me that there are
three modules to be carved out of unicodeobject.c:

1.  The internal object management that is not exposed to Python:
    allocation, deallocation, and PEP 393 transformations.

2.  The public interface to Python implementation: methods and
    properties, including operators.

3.  Interaction with the outside world: codec implementations.  But
    conceptually, these really don't have anything to do with internal
    implementation of Unicode objects.  They're just functions that
    convert bytes to Unicode and vice versa.  In principle they can be
    written in terms of ord(), chr(), and bytes().  On the other hand,
    they're rather repetitive: "When you've seen one codec
    implementation, you've seen them all."  I see no harm in grouping
    them in one file, and possibly a gain from proximity: casual
    passers-by might see refactorings that reduce redundancy.

I'm not sure what to do with the charmap stuff.  In current CPython
head it seems incoherent to me: there's an IO codec, but there's also
unicode-to-unicode stuff (PyUnicode_Translate).  I haven't had time to
look at Victor's reorganization to see what he actually did with it,
but in terms of modularity, it seems to me that refactoring this stuff
would be a real win, as opposed to splitting the files which is
presentational improvement for the rest of the code which is pretty
modular.

As for Victor's proposal itself:

  1176 Objects/unicodecharmap.c
  1678 Objects/unicodecodecs.c
  1362 Objects/unicodeformat.c
   253 Objects/unicodeimpl.h
   733 Objects/unicodelegacy.c
  1836 Objects/unicodenew.c
  2777 Objects/unicodeobject.c
  2421 Objects/unicodeoperators.c
  1235 Objects/unicodeoscodecs.c
  1288 Objects/unicodeutfcodecs.c

As Victor himself admits, "unicodelegacy" and "unicodenew" are not
descriptive of what they contain.  In I18N discussions, "legacy" is
usually a deprectory reference to non-Unicode encodings, and I would
tend to guess this file contains codecs from the name.  A better name
might be "unicodedeprecated" (if what he really means is deprecated
APIs).

I don't understand why splitting out "unicodeoperators" is a great
idea; it's done nowhere else in CPython.  If that makes sense, why not
split out "unicodemethods" (for methods normally invoked explicitly
rather than by syntax) too?  N.B. For bytes, the corresponding file is
spelled "bytes_methods".

"unicodecodecs" vs "unicodeutfcodecs": Say what?  I would forever be
looking in the wrong one.

"unicodeoscodecs" suggests to me that these codecs are only usable on
some OSes.  If so, shouldn't the relevant OS be in the name?  If not,
the name is basically misleading IMO.

Why are any of these codecs here in unicodeobjectland in the first
place?  Sure, they're needed so that Python can find its own stuff,
but in principle *any* codec could be needed.  Is it just an heuristic
that the codecs needed for 99% of the world are here, and other codecs
live in separate modules?

Steve