
Hi, I forked CPython repository to work on my "split unicodeobject.c" project: http://hg.python.org/sandbox/split-unicodeobject.c The result is 10 files (included the existing unicodeobject.c): 1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c 14759 total This is just a proposition (and work in progress). Everything can be changed :-) "unicodenew.c" is not a good name. Content of this file may be moved somewhere else. Some files may be merged again if the separation is not justified. I don't like the "unicode" prefix for filenames, I would prefer a new directory. -- Shorter files are easier to review and maintain. The compilation is faster if only one file is modified. The MBCS codec requires windows.h. The whole unicodeobject.c includes it just for this codec. With the split, only unicodeoscodecs.c includes this file. The MBCS codec needs also a "winver" variable. This variable is defined between the BLOOM filter and the unicode_result_unchanged() function. How can you explain how these things are sorted? Where should I add a new function or variable? With the split, the variable is now defined very close to where is it used. You don't have to scroll 7000 lines to see where it is used. If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file. It was already possible to extend and maintain unicodeobject.c (some people proved it!), but it should now be much simpler with shorter files. Note: unicodeobject.c is also composed by the huge stringlib library (4000 lines), which is shared with the bytes type. -- * Objects/unicodeimpl.h Private macros and prototype of private functions. Many unicode_xxx() functions has been renamed to _PyUnicode_xxx() to be able to reuse them in different files. * Objects/unicodenew.c Functions to create a new Unicode string (PyUnicode_New), convert from/to UCS4 and wchar_t*, resize a string. The ugly part of the PEP 393. * Objects/unicodeoperators.c find, replace, compare, split, fill, etc. * Objects/unicodeobject.c "str" type with all methods, _string module and unicodeiter type. * Objects/unicodeformat.c PyUnicode_FromFormat() and PyUnicode_Format() * Objects/unicodecodecs.c Text codecs for Python Unicode strings: - PyUnicode_Decode() - PyUnicode_AsEncodedObject() - PyUnicode_DecodeUnicodeEscape() - PyUnicode_DecodeRawUnicodeEscape(), PyUnicode_AsRawUnicodeEscapeString() - _PyUnicode_DecodeUnicodeInternal() - PyUnicode_DecodeLatin1(), PyUnicode_AsLatin1String() - PyUnicode_AsASCIIString() - PyUnicode_EncodeDecimal() - many helpers for other codecs - ... * Objects/unicodecharmap.c Character Mapping Codec: - PyUnicode_BuildEncodingMap() - PyUnicode_DecodeCharmap() - PyUnicode_AsCharmapString() - PyUnicode_Translate() * Objects/unicodeoscodecs.c Operating system codecs: MBCS codec, locale (FS) codec => FS encode/decode. * Objects/unicodeutfcodecs.c UTF-7/8/16/32 codecs and ASCII decoder. * Objects/unicodelegacy.c Legacy and deprecated Unicode API: Py_UNICODE type. Victor

2012/10/22 Victor Stinner <victor.stinner@gmail.com>:
Hi,
I forked CPython repository to work on my "split unicodeobject.c" project: http://hg.python.org/sandbox/split-unicodeobject.c
The result is 10 files (included the existing unicodeobject.c):
1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c 14759 total
This is just a proposition (and work in progress). Everything can be changed :-)
"unicodenew.c" is not a good name. Content of this file may be moved somewhere else.
Some files may be merged again if the separation is not justified.
I don't like the "unicode" prefix for filenames, I would prefer a new directory.
--
Shorter files are easier to review and maintain. The compilation is faster if only one file is modified.
The MBCS codec requires windows.h. The whole unicodeobject.c includes it just for this codec. With the split, only unicodeoscodecs.c includes this file.
The MBCS codec needs also a "winver" variable. This variable is defined between the BLOOM filter and the unicode_result_unchanged() function. How can you explain how these things are sorted? Where should I add a new function or variable? With the split, the variable is now defined very close to where is it used. You don't have to scroll 7000 lines to see where it is used.
If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file.
It was already possible to extend and maintain unicodeobject.c (some people proved it!), but it should now be much simpler with shorter files.
I would like to repeat my opposition to splitting unicodeobject.c. I don't think the benefits of such a split have been well justified, certainly not to the point that the claim about "much simpler" maintenance is true. -- Regards, Benjamin

On 23.10.2012 10:22, Benjamin Peterson wrote:
2012/10/22 Victor Stinner <victor.stinner@gmail.com>:
Hi,
I forked CPython repository to work on my "split unicodeobject.c" project: http://hg.python.org/sandbox/split-unicodeobject.c
The result is 10 files (included the existing unicodeobject.c):
1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c 14759 total
This is just a proposition (and work in progress). Everything can be changed :-)
"unicodenew.c" is not a good name. Content of this file may be moved somewhere else.
Some files may be merged again if the separation is not justified.
I don't like the "unicode" prefix for filenames, I would prefer a new directory.
--
Shorter files are easier to review and maintain. The compilation is faster if only one file is modified.
The MBCS codec requires windows.h. The whole unicodeobject.c includes it just for this codec. With the split, only unicodeoscodecs.c includes this file.
The MBCS codec needs also a "winver" variable. This variable is defined between the BLOOM filter and the unicode_result_unchanged() function. How can you explain how these things are sorted? Where should I add a new function or variable? With the split, the variable is now defined very close to where is it used. You don't have to scroll 7000 lines to see where it is used.
If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file.
It was already possible to extend and maintain unicodeobject.c (some people proved it!), but it should now be much simpler with shorter files.
I would like to repeat my opposition to splitting unicodeobject.c. I don't think the benefits of such a split have been well justified, certainly not to the point that the claim about "much simpler" maintenance is true.
Same feelings here. If you do go ahead with such a split, please only split the source files and keep the unicodeobject.c file which then includes all the other files. Such a restructuring should not result in compilers no longer being able to optimize code by inlining functions in one of the most important basic types we have in Python 3. Also note that splitting the file in multiple smaller ones will actually create more maintenance overhead, since patches will likely no longer be easy to merge from 3.3 to 3.4. BTW: The positive effect of having everything in one file is that you no longer have to figure which files to look when trying to find a piece of logic... it's just a ctrl-f or ctrl-s away :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 23 2012)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33 2012-10-23: Python Meeting Duesseldorf ... today eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Such a restructuring should not result in compilers no longer being able to optimize code by inlining functions in one of the most important basic types we have in Python 3.
I agree that performances are important. But I'm not convinced than moving functions has a real impact on performances, not that such issues cannot be fixed. I tried to limit changes impacting performances. Inlining is (only?) interesting for short functions. PEP 393 introduces many macros for this. I also added some "Fast" functiions (_PyUnicode_FastCopyCharacters() and _PyUnicode_FastFill()) which don't check parameters and do the real work. I don't think that it's really useful to inline _PyUnicode_FastFill() in the caller for example. I will check performances of all str methods. For example, str.count() is now calling PyUnicode_Count() instead of the static count(). PyUnicode_Count() adds some extra checks, some of them are not necessary, and it's not a static function, so it cannot(?) be inlined. But I bet that the overhead is really low. Note: Since GCC 4.5, Link Time Optimization are possible. I don't know if GCC is able to inline functions defined in different files, but C compilers are better at each release. -- I will check the impact of performances on _PyUnicode_Widen() and _PyUnicode_Putchar(), which are no more static. _PyUnicode_Widen() and _PyUnicode_Putchar() are used in Unicode codecs when it's more expensive to compute the exact length and maximum character of the output string. These functions are optimistic (hope that the output will not grow too much and the string is not "widen" too much times, so it should be faster for ASCII). I implemented a similar approach in my PyUnicodeWriter API, and I plan to reuse this API to simplify the API. PyUnicodeWriter uses some macro to limit the overhead of having to check before each write if we need to enlarge or widen the internal buffer, and allow to write directly into the buffer using low level functions like PyUnicode_WRITE. I also hope a performance improvement because the PyUnicodeWriter API can also overallocate the internal buffer to reduce the number of calls to realloc() (which is usually slow).
Also note that splitting the file in multiple smaller ones will actually create more maintenance overhead, since patches will likely no longer be easy to merge from 3.3 to 3.4.
I'm a candidate to maintain unicodeobject.c. In your check unicodeobject.c (recent) history, I'm one of the most active developer on this file since two years (especially in 2012). I'm not sure that merges on this file are so hard. Victor

Le 23/10/2012 12:05, Victor Stinner a écrit :
Such a restructuring should not result in compilers no longer being able to optimize code by inlining functions in one of the most important basic types we have in Python 3.
I agree that performances are important. But I'm not convinced than moving functions has a real impact on performances, not that such issues cannot be fixed.
I agree with Marc-André, there's no point in compiling those files separately. #include'ing them in the master unicodeobject.c file is fine. Regards Antoine.

2012/10/23 Antoine Pitrou <solipsis@pitrou.net>:
I agree with Marc-André, there's no point in compiling those files separately. #include'ing them in the master unicodeobject.c file is fine.
I also find the unicodeobject.c difficult to navigate. Even if we don't split the file, I'd advocate a better presentation of its content. Could we have at least clear sections, with titles and descriptions? And use the ^L page separator for Emacs users? Code in posixmodule.c could also benefit of a better layout. -- Amaury Forgeot d'Arc

On 10/23/2012 10:22 AM, Benjamin Peterson wrote:
2012/10/22 Victor Stinner <victor.stinner@gmail.com>:
Hi,
I forked CPython repository to work on my "split unicodeobject.c" project: http://hg.python.org/sandbox/split-unicodeobject.c
The result is 10 files (included the existing unicodeobject.c):
1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c 14759 total
This is just a proposition (and work in progress). Everything can be changed :-)
"unicodenew.c" is not a good name. Content of this file may be moved somewhere else.
Some files may be merged again if the separation is not justified.
I don't like the "unicode" prefix for filenames, I would prefer a new directory.
--
Shorter files are easier to review and maintain. The compilation is faster if only one file is modified.
The MBCS codec requires windows.h. The whole unicodeobject.c includes it just for this codec. With the split, only unicodeoscodecs.c includes this file.
The MBCS codec needs also a "winver" variable. This variable is defined between the BLOOM filter and the unicode_result_unchanged() function. How can you explain how these things are sorted? Where should I add a new function or variable? With the split, the variable is now defined very close to where is it used. You don't have to scroll 7000 lines to see where it is used.
If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file.
It was already possible to extend and maintain unicodeobject.c (some people proved it!), but it should now be much simpler with shorter files.
I would like to repeat my opposition to splitting unicodeobject.c. I don't think the benefits of such a split have been well justified, certainly not to the point that the claim about "much simpler" maintenance is true.
I agree. I haven't edited much in unicodeobject.c lately, so this is just an expression of my preference in general to keep things together. We tell new Python programmers to stop worrying about using indentation for grouping because editors are meant to make this easy. A similar argument applies to navigating large files: with a decent editor there is no real problem with large files. I agree completely with suggestions to improve sectioning and/or comments within the file. But once you make any split, people will look for things in the wrong file. It happens for me every time I look for something in either object.c or abstract.c -- that's an instance where the function name prefix doesn't imply the implementation file name, which is otherwise very clear and easy in the Python sources. Especially since you're suggesting a huge number of new files, I question the argument of better navigability. Georg BTW:
If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file.
After opening the right file, I *still* use the search function to get to the function I want to edit. Don't tell me using a scroll bar to scan for the right place is faster...

On 10/23/2012 09:29 AM, Georg Brandl wrote:
Especially since you're suggesting a huge number of new files, I question the argument of better navigability.
FWIW I'm -1 on it too. I don't see what the big deal is with "large" source files. If you have difficulty finding your way around unicodeobject.c, that seems like more like a tooling issue to me, not a source code structural issue. //arry/

On Oct 25, 2012 2:06 AM, "Larry Hastings" <larry@hastings.org> wrote:
On 10/23/2012 09:29 AM, Georg Brandl wrote:
Especially since you're suggesting a huge number of new files, I
argument of better navigability.
FWIW I'm -1 on it too. I don't see what the big deal is with "large"
question the source files. If you have difficulty finding your way around unicodeobject.c, that seems like more like a tooling issue to me, not a source code structural issue. OK, I need to weigh in after seeing this kind of reply. Large source files are discouraged in general because they're a code smell that points strongly towards a *lack of modularity* within a *complex piece of functionality*. Breaking such files up into separately compiled modules serves two purposes: 1. It proves that the code *isn't* a tangled monolithic mess; 2. It enlists the compilation toolchain's assistance in ensuring that remains the case in the future. I find complaints about the ease of searching within the file to be misguided and irrelevant, as I can just as easily reply with "if searching across multiple files is hard for you, use better tools, like grep, or 'Find in Files'". Note that I also consider the "pro" argument about better navigability inaccurate - the real gain is in *modularity*, making it clear to readers which parts can be understood and worked on separately from each other. We are not special snow flakes - good software engineering practice is advisable for us as well, so a big +1 from me for breaking up the monstrosity that is unicodeobject.c and lowering the barrier to entry for hacking on the individual pieces. This should come with a large block comment in unicodeobject.c explaining how the pieces are put back together again. However, -1 on the "faux modularity" idea of breaking up the files on disk, but still exposing them to the compiler and linker as a monolithic block, though. That would be completely missing the point of why large source files are bad. Regards, Nick. -- Sent from my phone, thus the relative brevity :)
/arry
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On Oct 25, 2012, at 08:15 AM, Nick Coghlan wrote:
OK, I need to weigh in after seeing this kind of reply. Large source files are discouraged in general because they're a code smell that points strongly towards a *lack of modularity* within a *complex piece of functionality*.
Modularity is good, and the file system structure of the project should reflect that, but to be effective, it needs to be obvious. It's pretty obvious what's generally in intobject.c. I've worked with code bases where there's no rhyme nor reason as to what you'd find in a particular file, and this really hurts. It hurts even with good tools. Remember that sometimes you don't even know what you're looking for, so search tools may not be very useful. For example, sometimes you want to understand how all the pieces fit together, what the holistic view of the subsystem is, or where the "entry points" are. Search tools are not very good at this, and if it's a subsystem you only interact with occasionally, having a file system organization that makes things easier to remember what you learned the last time you were there helps enormously. Another point: rather than large files (or maybe in addition to them), large functions can also be painful to navigate. So just splitting a file into subfiles may not be the only modularity improvement you can make. While I'm personally -0 about splitting up unicodeobject.c, if the folks advocating for it go ahead with it, I just ask that you do it very carefully, with an eye toward the casual and newbie reader of our code base. Cheers, -Barry

On Thu, Oct 25, 2012 at 8:37 AM, Barry Warsaw <barry@python.org> wrote:
On Oct 25, 2012, at 08:15 AM, Nick Coghlan wrote:
OK, I need to weigh in after seeing this kind of reply. Large source files are discouraged in general because they're a code smell that points strongly towards a *lack of modularity* within a *complex piece of functionality*.
Modularity is good, and the file system structure of the project should reflect that, but to be effective, it needs to be obvious. It's pretty obvious what's generally in intobject.c. I've worked with code bases where there's no rhyme nor reason as to what you'd find in a particular file, and this really hurts.
It hurts even with good tools. Remember that sometimes you don't even know what you're looking for, so search tools may not be very useful. For example, sometimes you want to understand how all the pieces fit together, what the holistic view of the subsystem is, or where the "entry points" are. Search tools are not very good at this, and if it's a subsystem you only interact with occasionally, having a file system organization that makes things easier to remember what you learned the last time you were there helps enormously.
And if we were talking in the abstract, I think these would be reasonable concerns to bring up. However, Victor's proposed division *is* logical (especially if he goes down the path of a separate subdirectory which will better support easy searching across all of the unicode object related files), and I conditioned my +1 with the requirement that a road map be provided in a leading block comment in unicodeobject.c. speed.python.org is also making progress, and once that is up and running (which will happen well before any Python 3.4 release) it will be possible to compare the numbers between 3.3 and trunk to help determine the validity of any concerns regarding optimisations that can be performed within a module but not across modules. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le 25/10/2012 02:03, Nick Coghlan a écrit :
speed.python.org is also making progress, and once that is up and running (which will happen well before any Python 3.4 release) it will be possible to compare the numbers between 3.3 and trunk to help determine the validity of any concerns regarding optimisations that can be performed within a module but not across modules.
Nobody needs speed.python.org to run benchmarks before and after a specific change, though. Cloning http://hg.python.org/benchmarks and using the perf.py runner is everything that is needed. Moreover, you would want to run benchmarks *before* committing and pushing the changes. We don't want the huge splitting to be recorded and then backed out in the repository history. Regards Antoine.

Nick Coghlan writes:
OK, I need to weigh in after seeing this kind of reply. Large source files are discouraged in general because they're a code smell that points strongly towards a *lack of modularity* within a *complex piece of functionality*.
Sure, but large numbers of tiny source files are also a code smell, the smell of purist adherence to the literal principle of modularity without application of judgment. If you want to argue that the pragmatic point of view nevertheless is to break up the file, I can see that, but I think Victor is going too far. (Full disclosure dept.: the call graph of the Emacs equivalents is isomorphic to the Dungeon of Zork, so I may be a bit biased.) You really should speak to the question of "how many" and "what partition".
the real gain is in *modularity*, making it clear to readers which parts can be understood and worked on separately from each other.
Yeah, so which do you think they are? It seems to me that there are three modules to be carved out of unicodeobject.c: 1. The internal object management that is not exposed to Python: allocation, deallocation, and PEP 393 transformations. 2. The public interface to Python implementation: methods and properties, including operators. 3. Interaction with the outside world: codec implementations. But conceptually, these really don't have anything to do with internal implementation of Unicode objects. They're just functions that convert bytes to Unicode and vice versa. In principle they can be written in terms of ord(), chr(), and bytes(). On the other hand, they're rather repetitive: "When you've seen one codec implementation, you've seen them all." I see no harm in grouping them in one file, and possibly a gain from proximity: casual passers-by might see refactorings that reduce redundancy. I'm not sure what to do with the charmap stuff. In current CPython head it seems incoherent to me: there's an IO codec, but there's also unicode-to-unicode stuff (PyUnicode_Translate). I haven't had time to look at Victor's reorganization to see what he actually did with it, but in terms of modularity, it seems to me that refactoring this stuff would be a real win, as opposed to splitting the files which is presentational improvement for the rest of the code which is pretty modular. As for Victor's proposal itself: 1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c As Victor himself admits, "unicodelegacy" and "unicodenew" are not descriptive of what they contain. In I18N discussions, "legacy" is usually a deprectory reference to non-Unicode encodings, and I would tend to guess this file contains codecs from the name. A better name might be "unicodedeprecated" (if what he really means is deprecated APIs). I don't understand why splitting out "unicodeoperators" is a great idea; it's done nowhere else in CPython. If that makes sense, why not split out "unicodemethods" (for methods normally invoked explicitly rather than by syntax) too? N.B. For bytes, the corresponding file is spelled "bytes_methods". "unicodecodecs" vs "unicodeutfcodecs": Say what? I would forever be looking in the wrong one. "unicodeoscodecs" suggests to me that these codecs are only usable on some OSes. If so, shouldn't the relevant OS be in the name? If not, the name is basically misleading IMO. Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules? Steve

On Thu, Oct 25, 2012 at 2:22 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Nick Coghlan writes:
OK, I need to weigh in after seeing this kind of reply. Large source files are discouraged in general because they're a code smell that points strongly towards a *lack of modularity* within a *complex piece of functionality*.
Sure, but large numbers of tiny source files are also a code smell, the smell of purist adherence to the literal principle of modularity without application of judgment.
Absolutely. The classic example of this is Java's unfortunate insistence on only-one-public-top-level-class-per-file. Bleh.
If you want to argue that the pragmatic point of view nevertheless is to break up the file, I can see that, but I think Victor is going too far. (Full disclosure dept.: the call graph of the Emacs equivalents is isomorphic to the Dungeon of Zork, so I may be a bit biased.) You really should speak to the question of "how many" and "what partition".
Yes, I agree I was too hasty in calling the specifics of Victor's current proposal a good idea. What raised my ire was the raft of replies objecting to the refactoring *in principle* for completely specious reasons like being able to search within a single file instead of having to use tools that can search across multiple files. unicodeobject.c is too big, and should be restructured to make any natural modularity explicit, and provide an easier path for users that want to understand how the unicode implementation works.
the real gain is in *modularity*, making it clear to readers which parts can be understood and worked on separately from each other.
Yeah, so which do you think they are? It seems to me that there are three modules to be carved out of unicodeobject.c:
1. The internal object management that is not exposed to Python: allocation, deallocation, and PEP 393 transformations.
2. The public interface to Python implementation: methods and properties, including operators.
3. Interaction with the outside world: codec implementations. But conceptually, these really don't have anything to do with internal implementation of Unicode objects. They're just functions that convert bytes to Unicode and vice versa. In principle they can be written in terms of ord(), chr(), and bytes(). On the other hand, they're rather repetitive: "When you've seen one codec implementation, you've seen them all." I see no harm in grouping them in one file, and possibly a gain from proximity: casual passers-by might see refactorings that reduce redundancy.
I suspect you and Victor are in a much better position to thrash out the details than I am. It was the trend in the discussion to treat the question as "split or don't split?" rather than "how should we split it?" when a file that large should already contain some natural splitting points if the implementation isn't a tangled monolithic mess.
Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules?
I believe it's a combination of history and whether or not they're needed by the interpreter during the bootstrapping process before the encodings namespace is importable. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 25.10.2012 08:42, Nick Coghlan wrote:
Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules?
I believe it's a combination of history and whether or not they're needed by the interpreter during the bootstrapping process before the encodings namespace is importable.
They are in unicodeobject.c so that the compilers can inline the code in the various other places where they are used in the Unicode implementation directly as necessary and because the codecs use a lot of functions from the Unicode API (obviously), so the other direction of inlining (Unicode API in codecs) is needed as well. BTW: When discussing compiler optimizations, please remember that there are more compilers out there than just GCC and also the fact that not everyone is using the latest and greatest version of it. Link time inlining will usually not be as efficient as compile time optimization and we need every bit of performance we can get for Unicode in Python 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2012)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Thu, Oct 25, 2012 at 8:57 AM, M.-A. Lemburg <mal@egenix.com> wrote:
On 25.10.2012 08:42, Nick Coghlan wrote:
Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules?
I believe it's a combination of history and whether or not they're needed by the interpreter during the bootstrapping process before the encodings namespace is importable.
They are in unicodeobject.c so that the compilers can inline the code in the various other places where they are used in the Unicode implementation directly as necessary and because the codecs use a lot of functions from the Unicode API (obviously), so the other direction of inlining (Unicode API in codecs) is needed as well.
I'm sorry to interrupt, but have you actually measured? What effect the lack of said inlining has on *any* benchmark is definitely beyond my ability to guess and I suspect is beyond the ability to guess of anyone else on this list. I challenge you to find a benchmark that is being significantly affected (>15%) with the split proposed by Victor. It does not even have to be a real-world one, although that would definitely buy it more credibility. Cheers, fijal

On 25.10.2012 11:18, Maciej Fijalkowski wrote:
On Thu, Oct 25, 2012 at 8:57 AM, M.-A. Lemburg <mal@egenix.com> wrote:
On 25.10.2012 08:42, Nick Coghlan wrote:
Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules?
I believe it's a combination of history and whether or not they're needed by the interpreter during the bootstrapping process before the encodings namespace is importable.
They are in unicodeobject.c so that the compilers can inline the code in the various other places where they are used in the Unicode implementation directly as necessary and because the codecs use a lot of functions from the Unicode API (obviously), so the other direction of inlining (Unicode API in codecs) is needed as well.
I'm sorry to interrupt, but have you actually measured? What effect the lack of said inlining has on *any* benchmark is definitely beyond my ability to guess and I suspect is beyond the ability to guess of anyone else on this list.
I challenge you to find a benchmark that is being significantly affected (>15%) with the split proposed by Victor. It does not even have to be a real-world one, although that would definitely buy it more credibility.
I think you misunderstood. What I described is the reason for having the base codecs in unicodeobject.c. I think we all agree that inlining has a positive effect on performance. The scale of the effect depends on the used compiler and platform. Victor already mentioned that he'll check the impact of his proposal, so let's wait for that. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2012)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

I think you misunderstood. What I described is the reason for having the base codecs in unicodeobject.c.
I think we all agree that inlining has a positive effect on performance. The scale of the effect depends on the used compiler and platform.
Well. Inlining can have positive or negative effects, depending on various details. Too much inlining causes more cache misses for example. However, this is absolutely irrelevant if you don't create benchmarks and run them. Guessing is seriously not a very good optimization strategy. Cheers, fijal

On Thu, Oct 25, 2012 at 8:07 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
I think you misunderstood. What I described is the reason for having the base codecs in unicodeobject.c.
I think we all agree that inlining has a positive effect on performance. The scale of the effect depends on the used compiler and platform.
Well. Inlining can have positive or negative effects, depending on various details. Too much inlining causes more cache misses for example. However, this is absolutely irrelevant if you don't create benchmarks and run them. Guessing is seriously not a very good optimization strategy.
Yep, that's why I made the point that speed.python.org should be a going concern well before 3.4 release, and will be able to let us know if we have a problem relative to 3.3. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 25.10.12 12:49, M.-A. Lemburg wrote:
I think you misunderstood. What I described is the reason for having the base codecs in unicodeobject.c.
For example PyUnicode_FromStringAndSize and PyUnicode_FromString are thin wrappers around PyUnicode_DecodeUTF8Stateful. I think this is a reason to keep this functions together.

On 25.10.12 12:18, Maciej Fijalkowski wrote:
I challenge you to find a benchmark that is being significantly affected (>15%) with the split proposed by Victor. It does not even have to be a real-world one, although that would definitely buy it more credibility.
I see 10% slowdown for UTF-8 decoding for UCS2 strings, but 10% speedup for mostly-BMP UCS4 strings. For encoding the situation is reversed (but up to +27%). Charmap decoding speedups 10-30%. GCC 4.4.3, 32-bit Linux. https://bitbucket.org/storchaka/cpython-stuff/src/default/bench

On 25.10.2012 08:42, Nick Coghlan wrote:
unicodeobject.c is too big, and should be restructured to make any natural modularity explicit, and provide an easier path for users that want to understand how the unicode implementation works.
You can also achieve that goal by structuring the code in unicodeobject.c in a more modular way. It is already structured in sections, but there's always room for improvement, of course. As mentioned before, it is impossible to split out various sections into separate .c or .h files which then get included in the main unicodeobject.c. If that's where consensus is going, I'm with Stephen here in that such a separation should be done in higher level chunks, rather than creating >10 new files. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2012)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Le 25/10/2012 00:15, Nick Coghlan a écrit :
However, -1 on the "faux modularity" idea of breaking up the files on disk, but still exposing them to the compiler and linker as a monolithic block, though. That would be completely missing the point of why large source files are bad.
I disagree with you. Source files are meant to be read by humans, we don't really care whether the compiler has a modular view of the source code. Regards Antoine.

On 10/24/2012 03:15 PM, Nick Coghlan wrote:
Breaking such files up into separately compiled modules serves two purposes:
1. It proves that the code *isn't* a tangled monolithic mess; 2. It enlists the compilation toolchain's assistance in ensuring that remains the case in the future.
Either the code is a "tangled monolithic mess" or it isn't. If it is, then let's fix that, regardless of the size of the file. If it isn't, I don't see breaking up the code among multiple files as providing any benefit. And I see no need for the toolchain's assistance to help us do something without benefit. The line count of the file is essentially unrelated to its inherent quality / maintainability.
We are not special snow flakes - good software engineering practice is advisable for us as well, so a big +1 from me for breaking up the monstrosity that is unicodeobject.c and lowering the barrier to entry for hacking on the individual pieces. This should come with a large block comment in unicodeobject.c explaining how the pieces are put back together again.
I'm all for good software engineering practice. But can you cite objective reasons why large source files are provably bad? Not "tangled monolithic messes", not poorly-factored code. I agree that those are bad--but so far nobody has proposed that either of those is true about unicodeobject.c (unless you are implicitly doing so above), nor have they proposed credible remedies. All I've seen is that unicodeobject.c is a large file, and some people want to break it up into smaller files. I have yet to see anything but handwaving as justification. For example, what is this barrier to entry you suggest exists to hacking on the str object, that will apparently be dispelled simply by splitting one file into multiple files? Someone proposed breaking up unicodeobject.c into three distinct subsystems and putting those in separate files. I still don't agree. It seems natural to me to have everything associated with the str object in one file, just as we do with every other object I can think of. If this were a genuinely good idea, we should consider doing it with every similar object. But nobody is proposing that. My guess is because the other files in CPython are "small enough". At which point we're right back to the primary motivation simply being the line count of unicodeobject.c, as a purely aesthetic and subjective judgment. //arry/

On Thu, 25 Oct 2012 08:13:53 -0700 Larry Hastings <larry@hastings.org> wrote:
I'm all for good software engineering practice. But can you cite objective reasons why large source files are provably bad? Not "tangled monolithic messes", not poorly-factored code. I agree that those are bad--but so far nobody has proposed that either of those is true about unicodeobject.c (unless you are implicitly doing so above)
Well, "tangled monolithic mess" is quite true about unicodeobject.c, IMO. Seriously, I agree with Victor: navigating around unicodeobject.c is a PITA. Perhaps it isn't if you are using emacs, or you have 35 fingers, or just a lot of spare time, but in my experience it's painful. Regards Antoine.

Antoine Pitrou writes:
Well, "tangled monolithic mess" is quite true about unicodeobject.c, IMO.
s/object.c// and your point remains valid. Just reading the table of contents for UTR#17 (http://www.unicode.org/reports/tr17/) should convince you that it's not going to be easy to produce an elegant implementation!
Seriously, I agree with Victor: navigating around unicodeobject.c is a PITA. Perhaps it isn't if you are using emacs, or you have 35 fingers, or just a lot of spare time, but in my experience it's painful.
Sure, but I don't know of a Unicode implementation which isn't. I don't think that having a unicode/*.[ch] with a dozen files (including the README etc) in it is going to make it much more navigable. If there are too many files, it's going to be a PITA to maintain because there won't be an obvious place to put certain functions. Eg, I've already mentioned my suspicions about the charmap code (I apologize for not reading Victor's code to confirm them). I don't object in principle to splitting the unicodeobject.c. At the very least, with all due respect to MAL, XEmacs experience with coding systems (the Emacs equivalent of Python codecs) suggests that there is very little to be lost by moving the codec implementations to a separate file from the Unicode object implementation. (Here I'm talking about codecs in the narrow sense of wire-format to Python3 str and back, not the more general Python2 sense that included zip and base64 and so on. Ie, PyUnicode_Translate is not a codec in the relevant sense.) On the other hand, I wouldn't be surprised if (despite my earlier suggestion) codecs and unicode object internals need a close relationship. (My intuition and sense of style says splitting codecs from the low level memory management and PEP 393 stuff is a good idea, but I'm not confident it would have no impact on performance.)
participants (12)
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
Barry Warsaw
-
Benjamin Peterson
-
Georg Brandl
-
Larry Hastings
-
M.-A. Lemburg
-
Maciej Fijalkowski
-
Nick Coghlan
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Victor Stinner