[summerofcode] Application sent
Ian Bicking
ianb at colorstudy.com
Sat Jun 4 09:10:07 CEST 2005
ChunWei Ho wrote:
> Some told me that publishing the idea may not be good, but I thought
> (a) Its not worth stealing
> (b) Python programmers are above this (*naivety* point)
> (c) The potential mentors would be reading this anyway
I think it's good -- one thing that open source has taught me is that
ideas aren't very valuable when compared to implementation (code or
otherwise). And transparency is good.
> Project Title: Data Serving/Collection Framework in Python/WSGI
>
> Proposed Mentor/Sponsoring Organization: Python Software Foundation
>
> Project Description:
> A framework based on bulk data serving/collection via the internet.
> Bulk data are in the form of files that could easily be
> several-several hundred MB (not surveys or simple POST data).
>
> The client has a file repository that it wishes to sync to the server
> (a WSGI application). This server should be able to facilitate
> transfer via a number of protocols, including HTTP file transfer, HTTP
> form upload, FTP, Email.
>
> This project is aimed not at yet another ad-hoc file transfer or p2p
> file-sharing program but as a persistent production setup for
> transferring data from data collection sites/areas to a server,
> possibly via internet through different methods to get through strict
> organizational firewalls and web admins.
>
> Unlike a normal straightforward file transfer application, the
> framework should support:
> + Authentication and encryption
> + Verification scheme for data transfer, retries, etc - MD5 hash compare?
> + Chunking of large files and reassemble on receipt
> + Partial/Resume file transfers support - may depend on nature of data
This can also be part of the file transfer app, using HTTP range support.
Each piece that can be implemented in a generic way will be easier to
decouple, test, and implement. And HTTP has a lot of possible
functionality that's worth implementing directly. For instance, etags
are similar in function to hashes, and there's a standard header for
giving the hash of a body (I don't think it gets much use, though,
because TCP/IP is reliable enough). Even encryption can even be done in
terms of SSL with client certificates (though that might be difficult,
as SSL happens at a level that is sometimes hard to get access to,
depending on your server).
> Also, unlike commercial advanced file transfer programs, the framework does:
> + Supports multiple protocols for transfer HTTP/FTP/Email
> + Automatic identification of files to synchronize (comparison of
> server and client repositories and request automatically)
> + Conditional Processing (triggers - resync file if modified? logic -
> user specified)
> + Robust and considerate client - may be shared machine, means a
> service (I initially designed it for Windows clients - platform choice
> was not up to me) that must be configurable on when it runs, how long
> it runs
> + and if configured limit does not allow client to sync all data -
> what must be synced first (Latest file first, Earliest file first,
> Latest file only, etc). This form of consideration seems to be
> important for running on production sites or factory machines when the
> machine is in use in the day but idle for our use at night, or when
> machines have internet connectivity (possibly dialup) at only certain
> times of the day.
How do you see it as different from rsync? If it's not that different,
that's not so bad -- derivative perhaps, but rsync is very popular and
useful, and you can do a lot worse than copy a useful piece of software.
If the pieces that are used to implement it are decoupled, then that
leaves yourself or other people room to recombine the pieces in novel
ways, while at the same time copying something useful means you'll have
a set of pieces that have proven utility.
This will be especially true to the degree you utilize HTTP's potential.
> Development will be based on WSGI/Paste model, although I will also
> investigate Zope/Cherry/Plone and other frameworks purely for
> comparison or design consideration purposes. WSGI is chosen for small
> learning curve, as well as the fact that data collection for an
> application can be separated from other functions.
I think the benefit of WSGI here -- and I think it is considerable -- is
that it is low-level enough that you don't have to work around places
where the framework isn't intended for how you are using it. This is
especially true of large file support and more advanced HTTP
functionality (like ranges and etags and that sort of thing). Lots of
frameworks are notably bad at large files in particular.
--
Ian Bicking / ianb at colorstudy.com / http://blog.ianbicking.org
More information about the summerofcode
mailing list