Request for research feedback

Fulvio Valente fulvio.valente at strath.ac.uk
Thu Sep 15 17:55:13 CEST 2011


Hi, I am a research intern at the University of Strathclyde who has been doing a summer research internship. I hope that this is not an inappropriate place to ask, but I am looking for participants willing to use and evaluate an application that was written as part of this internship. If you choose to participate, this should only take no more than an hour or two of your time at the very most, unless you wish to not use the analysis files provided.

My project is to experiment with ways of inferring and displaying socio-technical information about a software project from a version control repository by creating a prototype application. The application infers authorship information in the form of "closeness scores" between authors and the files they have worked on by traversing the history of a Mercurial repository. This is then displayed as a map of socio-technical dependencies between files and the authors that worked on them and between authors and the files they have worked on, which should allow for easy visual comparisons of closeness.

Some potential applications and use cases I can envision in the program's current form include discovering who the key figures are (i.e. who to talk to about a problem) or what the bus factor (how many people need to be "hit by a bus" before a project is in trouble because there's nobody left who understands it) is for a project or a subsystem within it. Perhaps a little more depressingly, this also maybe be used to highlight potential cases of poor productivity to be investigated.

The program itself, Jimmy (for lack of a better name), has a binary distribution which can be found at http://geeksoc.org/~fvalente/jimmy/jimmy_r1.zip and only requires Java 7 and Mercurial (other libraries are included), getting you started quickly. The source is available at https://bitbucket.org/fvalente/jimmy and if you wish to build it yourself, it depends on Guava r09 and JUNG 2.0.1 (api, algorithms, graph-impl, visualization) which itself depends on Apache Commons collections-generic 4.0.1.

To perform a basic analysis of a project, you can open a directory that's a Mercurial repository and it will just look at the list of commits and the files that changed, adding 1 to a score each time an author touches a file, which should only take a minute or two, even for large projects. If you have more time, you can do the more expensive diff stats analysis which compares the size of each diff with the average diff size of the project, excluding empty diffs and binary changes. Unfortunately, the diff stats analysis is very slow due to retrieving each diff requiring the spawning of a hg subprocess (for reference, my 4 year old quad-core machine can do only ~10,000 commits per hour). I don't have a progress UI yet, but progress status is sent to stdout when doing a diff stats analysis. Once analysis is complete, you can save the results to review later by using the open analysis results option.

To navigate the results you can switch to viewing a particular file or author of interest from the file and author lists on the left. For files, this will display that file as a node in the centre with the authors that have been involved with it as orbiting nodes, with the connecting lines' length, thickness and colour shortening, thickening and approaching red respectively as the closeness score between that author and the file increases. For authors, it is the same except the files they have worked on will be the orbiting nodes. You can also directly navigate to having a display based on an orbiting node by clicking it in the display rather than searching through the file or author lists. The display can be zoomed by using the scroll wheel and can be translated with the scroll bars or by dragging on an area not occupied by a node.

What I would like is for you to please run Jimmy on one or more Mercurial repositories of your choice and to give some feedback on it. Some questions I'd particularly like answered are:

* Do the closeness scores it produces match with your perception of the relationships between people and code in the project? (e.g. if you're looking at a particular file and some authors involved in it are shown as closer than others, is this the result you would have expected from a perfect version of Jimmy?)
* Does the visualisation of the scores substantially improve your ability to draw conclusions from the data compared to just reading a saved analysis (which is just plaintext)?
* If, hypothetically, you had no prior knowledge about the project, would using it help you to discover the key figures (e.g. maintainer, BDFL) behind the project or any of its subsystems? (Alternatively, do such people correctly show up as having touched a wider variety of files and with closer relations to them than other people?)
* If you were a manager would you be able to use it to discover potential productivity issues that you would then investigate further?

To help save you time from having to do a full analysis of a project, I have supplied analysis files from 3 open-source projects which you can open with the "Open analysis results" option:

* cpython: http://geeksoc.org/~fvalente/jimmy/cpython.txt
* libSDL: http://geeksoc.org/~fvalente/jimmy/libsdl.txt
* Go: http://geeksoc.org/~fvalente/jimmy/golang.txt

Some general suggestions on whether and why the current ways of inferring closeness scores and visualising that data are flawed would also be greatly appreciated, as well as potential avenues to explore for improving them. Suggestions I've already received include:

* Being able to collapse files into folders or subsystem groups to make larger projects more navigable, perhaps with autoexpansion when zooming the display. In its current form, Jimmy produces disappointing/funny results when you want to see a diagram for, say, a large project's maintainer.
* Being able to mine data from a subset of the repository (time range, revision range, include/exclude files/directories, etc.)
* Reducing the closeness score contributions of multiple commits made in quick succession, or another method of mitigating the bias in favour of fine-grained committers
* Reducing the closeness score contributions of older commits
* Interfacing with Mercurial via the recently introduced command server API, which should hopefully make performance non-abysmal
* Support for more version control systems. Git would top this list
* Perhaps the ability to see a timeline for the project and how closeness changes over time

Responses can be made privately to me, if you wish. For the purposes of my report, I will also anonymise all responses received in line with ethical best practices. Thank you for reading.


More information about the Python-list mailing list