[SoC2010-General] Proposal for porting RPy2 to Python 3
jergosh at gmail.com
Thu Apr 8 00:21:05 CEST 2010
My name is Greg Slodkowicz and I would like to apply for the RPy2 porting project. I am currently a student pursuing an MSc in Systems Biology but my background is mainly in Computer Science. I have been in touch with RPy2's author Laurent Gautier and this application reflects our discussions about scope of the project.
RPy2 is an interface between Python and the statistical package R. It allows accessing R's rich collection of libraries from Python. RPy2 is mainly used in bioinformatics, geostastistics and finance and has got a substantial user base for a scientific project (~1k downloads per month). It is, however, currently only compatible with Python 2.x. This project aims to port existing functionality of RPy2 to Python 3 as well as to improve integration and performance by taking advantage of new features in Python 3. It will be completed in three stages: porting the RPy2 package to match functionality of the current release, improving functionality and performance by taking advantage of the new features of Python and its C API (MemoryViews, PyCapsules, ordered dictionaries) and lastly implementing an R graphical device which would be able to interface with Matplotlib.
RPy2 is a package providing Python interface to the popular statistical package R. Thanks to it, any of statistical modules in R can be accessed from Python. Since RPy2 is used in a variety of areas such as bioinformatics, geostastistics and finance, this GSoC project would help boost adoption of Python 3.
== Project schedule
Preparation for the project (during 'Community Bonding' period):
* reading documentation (I have some experience with writing C extensions for Python, but I am not so familiar with R internals).
* discussing design and implementation details of the graphical device (final part of the project).
I would then implement the project in the following stages:
* 24.05-13.06 Matching functionality of current version of RPy2.
This part would be easy if it was enough just to replace calls to the old API with Py3 ones. It is not clear, however, when R expects ASCII strings ('bytes' type in Python 3) and when Unicode (default strings in Python), and it will segfault when it gets the wrong kind. Fixing this will likely require a lot of debugging and detective work on R source code. Other issues with interfacing R and Python may also crop up during implementation. (3 weeks total)
* 14.06-11.07 Improving/optimising the integration using new features from Py3 API.
There are a few features in the new Python C API that would fit in well with R's internal data structures:
- MemoryViews could be used to efficiently expose arrays in R
- PyCapsules would be a great wrapper for R's SEXP data type
- Ordered dictionaries are very similar to the Pairlist sexp type (LISTSXP) which R uses for passing function arguments (4 weeks total)
* 12.07-16.07 Testing, last minute bug fixing and finalising documentation
* 16.07 Mid-term evaluation
* 17.07-8.08 Graphical device connecting R and Matplotlib.
Implementing an R device which could interface with Matplotlib would tighten the integration between Python and R (RPy2 is already compatible with NumPy). (4 weeks)
* 9.08-16.08 Testing, last minute bug fixing and finalising documentation.
* 16.08 Firm pencil down date.
== About me
I have completed the first two years of my Bachelor's degree in Computer Science at the Technical University of Lodz, Poland after which I moved to Denmark to study at the Technical University of Denmark (DTU) in 2008. I have since changed my focus to bioinformatics and biology. I have been working as a student helper since November 2008 and later (since August 2009) as a scientific programmer at the Centre for Biological Sequence Analysis (part of department of Systems Biology at DTU). My primary area of focus is the development of software for data management and analysis.
My first experience with programming was when I taught myself C from K&R when I was fourteen. By the time I started my studies, I have completed a few toy projects, including two years of game (MUD) development in LPC, a dialect of C. At the same time I discovered Python and it immediately became my language of choice. I used Python to develop an entry in a programming competition organised by a Polish social networking portal (I wrote a Python wrapper for their API and a small desktop notification app).
After beginning my studies I gained some experience in commercial software development:
I completed a small project in Python for a company managing online orders for restaurants. The application converted Google Checkout orders to text messages which were then dispatched to appropriate restaurants.
In the summer of 2008, I participated in a project at the Dublin Institute of Technology. Along with two other students, I implemented (in C++ using Qt4 libraries) an application for managing simulations and parsing, analysing and displaying results in real-time (the research area there was quality in VOIP transmissions).
During my studies, focus was placed mainly on programming in C and C++ (most of my courses were at an Electrical Engineering faculty). I excelled in courses which involved study and implementation of algorithms and data structures. My current work involves C++ development for scientific applications where performance is critical, Python for general scripting and R for statistical data analysis.
I also have 5 years of experience with Linux (mainly Debian and later Ubuntu) which I used as my main platform before switching to Mac 1.5 years ago.
Thank you for reading my application. I would be happy to provide any additional details.
More information about the SoC2010-General