[Python-Dev] Split unicodeobject.c into subfiles

Thu Oct 25 08:42:55 CEST 2012

On Thu, Oct 25, 2012 at 2:22 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
>
>  > OK, I need to weigh in after seeing this kind of reply. Large source files
>  > are discouraged in general because they're a code smell that points
>  > strongly towards a *lack of modularity* within a *complex piece of
>  > functionality*.
>
> Sure, but large numbers of tiny source files are also a code smell,
> the smell of purist adherence to the literal principle of modularity
> without application of judgment.

Absolutely. The classic example of this is Java's unfortunate
insistence on only-one-public-top-level-class-per-file. Bleh.

> If you want to argue that the pragmatic point of view nevertheless is
> to break up the file, I can see that, but I think Victor is going too
> far.  (Full disclosure dept.: the call graph of the Emacs equivalents
> is isomorphic to the Dungeon of Zork, so I may be a bit biased.)  You
> really should speak to the question of "how many" and "what partition".

Yes, I agree I was too hasty in calling the specifics of Victor's
current proposal a good idea. What raised my ire was the raft of
replies objecting to the refactoring *in principle* for completely
specious reasons like being able to search within a single file
instead of having to use tools that can search across multiple files.

unicodeobject.c is too big, and should be restructured to make any
natural modularity explicit, and provide an easier path for users that
want to understand how the unicode implementation works.

>  > the real gain is in *modularity*, making it clear to readers which
>  > parts can be understood and worked on separately from each other.
>
> Yeah, so which do you think they are?  It seems to me that there are
> three modules to be carved out of unicodeobject.c:
>
> 1.  The internal object management that is not exposed to Python:
>     allocation, deallocation, and PEP 393 transformations.
>
> 2.  The public interface to Python implementation: methods and
>     properties, including operators.
>
> 3.  Interaction with the outside world: codec implementations.  But
>     conceptually, these really don't have anything to do with internal
>     implementation of Unicode objects.  They're just functions that
>     convert bytes to Unicode and vice versa.  In principle they can be
>     written in terms of ord(), chr(), and bytes().  On the other hand,
>     they're rather repetitive: "When you've seen one codec
>     implementation, you've seen them all."  I see no harm in grouping
>     them in one file, and possibly a gain from proximity: casual
>     passers-by might see refactorings that reduce redundancy.

I suspect you and Victor are in a much better position to thrash out
the details than I am. It was the trend in the discussion to treat the
question as "split or don't split?" rather than "how should we split
it?" when a file that large should already contain some natural
splitting points if the implementation isn't a tangled monolithic
mess.

> Why are any of these codecs here in unicodeobjectland in the first
> place?  Sure, they're needed so that Python can find its own stuff,
> but in principle *any* codec could be needed.  Is it just an heuristic
> that the codecs needed for 99% of the world are here, and other codecs
> live in separate modules?

I believe it's a combination of history and whether or not they're
needed by the interpreter during the bootstrapping process before the
encodings namespace is importable.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia