[summerofcode] Lotus IDVFS Application

Brendan Kohler gtg902n at mail.gatech.edu
Tue Jun 14 06:04:19 CEST 2005


Hi, I'm adding my application here as I could not get the entire thing to fit in
the provided text box from the google application. I also put a note in the
application I submitted (well, the final application I submitted) on where to
find the rest of my application (on my website or here).

Anyway, here is the full application...

ABSTRACT:
The goal of the Lotus project is to create an Intelligent, Distributed, Virtual
File System (IDVFS) that is portable across most operating systems and hardware
configurations.

The main tenets of the Lotus Project are that the distributed file system should
be:

1. Portable
2. Secure
3. Redundant
4. Intuitive
5. Scalable
6. Intelligent

To achieve portability the project will simply use two cross-platform languages
that fit perfectly: Python and SQL.

Security is a complex area and therefore accommodate many different levels of
security to fit the user’s requirements.

Redundancy will be achieved by designing the system to adapt to availability of
different network resources and will mirror files on different servers
intelligently.

The Lotus system should not just be intuitive for the end user; it should be
intuitive for the administrator to set up and customize as well. As such, the
system will feature an extensive set of administrative tools, as well as an
advanced system of permissions and a file structure that will benefit both
administrators and end users. There is no reason to confuse users and decrease
productivity with rigid and archaic systems.

Scalability is a major issue and there is no reason that Lotus should not be
suitable for applications ranging from large networks with geographical
separation to networks in a single household. The Lotus file system will be
designed to fit the needs of most network configurations, including completele
integration across subnetworks of multiple servers and levels.

Intelligence is needed to fulfill the design requirements, and the Lotus system
will feature a NetAI system that will manage load balancing, network topology
and server disk systems for maximum availability of network resources.


NEED AND COMMUNITY IMPACT:
At this point, you may be asking why anyone would need this. The truth is, this
system has many applications ranging from acting as the backend for web
applications to home network integration. Consider how nice it would be to
connect your home network of 5+ computers and view all the files as though they
were part of a single file system, with the ability to “hot plug” and sync your
laptop with the rest of the network. Imagine being able to search your entire
network (featuring machines with different OS’s) without having to search each
individual machine. Consider the impact of being able to use a network in a
business that supports workgroups and special permissions with complete
syncronization between geographically separated locations and redundancy similar
to RAID across an entire network. Rather than continuing to list every
application I can imagine (which is a lot ;) ), in the interest of space I have
presented three case studies at the last section of the document (entitled “CASE
STUDIES”).

What are the specific benefits to the Python community? First, the ability to
have a managed python backend integrated into a web application written in
python is of immense value in my opinion. The ability to scale this backend from
single systems to clusters also makes this system future-proof. The second
benefit to the community is the free availability of modules I am creating to
assist in the creation of this project. These modules include a python-only
shared memory system for IPC and a simplified IPC socket system that can be as
easily used for IPC on a single system as IPC on a network of systems.


CURRENT PROGRESS:
This project is a large undertaking. I know very well the limitations of time
for the summer of code and the rate at which I work. I could never hope to
complete this project in anywhere near the amount of time given in the summer of
code, were I to start from scratch. The fact is, however, that I first began
design of this project in December and have been slowly making progress ever
since. Most work has been in design, not coding, but I have completed coding and
recoding several systems (including the CWS). Currently I have a working CWS and
many other parts done. I’ve also gotten most of my IPC issues and SQL
configurations worked out. In short, I’m right in the middle of the project and
beginning work on the second major subsystem (NetAI/S). The goals I will set in
the following GOALS section will be attainable with an estimation of roughly
thirty hours/week of work on this project.


GOALS:
Due to the size of this project, my goals are more modest than the completion of
the whole system in its entirety. The following are my goals:

The system will be able to serve multiple client connections to a single server
and each client will be able to perform account management and file management.
This requires a fully working CWS, Client Handler and NetAI/S excepting the
server-server network connectivity logic and security features such as
encryption. Also, error logging will be implemented, though the rest of the
administration subsystem will remain incomplete.


PERSONAL INFORMATION:
I am a CS and Physics student at the Georgia Institute of Technology. I have
been programming with python since August of 2004 and have taken two classes in
school on it. One was a basic Python class and the second was an independent
Study in Python I was granted by the CS department and overseen by the professor
who teaches the basic Python class. This summer I am taking two fairly intense
classes (Modern Physics and Differential Equations) and unfortunately have to
choose between schoolwork and my project since I have to work to pay out of
state tuition. I’ve applied for the Summer of Code in hopes that I can work on
something I’m really excited about instead of having to set my project aside for
a regular job and still get my school work done. In preparation (should I get
chosen
I didn’t expect there would be quite so many submissions as there
apparently are) I have registered a domain for the purpose of my project
(pycoder.com) and am designing a website to that I may do blog-style updates on
my progress as I work.


SYSTEM ARCHITECTURE:
The system is divided into two sections, referred to as the Server System and
Management.

A. Server System:
The Server System consists of three loosely coupled subsystems, the CWS,
NetAI/S, and Client Handler.

   I. CWS (Cache Write System)
The CWS handles caching of files and writing to disk and SQL. The system avoids
concurrency issues in which a client tries to read a file being written or two
people try to write to the same file at once. Also, it keeps recent accessed or
heavily used files in memory for quick access, which is especially important
should the file be stored on a different server.

   II. NetAI/S (Network AI/Security)
The The security portion of the system handles accounts and permissions for
clients and provides security services such as encryption. The AI portion of the
system handles request routing, storage management and load balancing through
careful monitoring of all the servers on the network.

   III. Client Handler
The Client Handler system connects a client to the rest of the network and
handles permissions enforcement, request parsing, and various things like
searches and traversing directories. An instance of the Client Handler is
spawned for every client, taking care of all requests for that single client and
providing updates about files changed and accounts modified.


B. Management System:
The management system consists of two subsystems: the Administration/Logging and
the Socket Server.

   I. Administration/Logging:
This subsystem can communicate with any other subsystem, forcing commands,
performing diagnostics, and recording errors and client activity in a log.

   II. Socket Server:
This subsystem listens for connections from the WAN and logs in users,
redirecting them to another free port and spawning an instance of the Client
Handler to handle the connection. In a way the Socket Server acts as a buffer
between the Lotus IDVFS and the WAN, ensuring only logged in users can get
access to the file system.



DETAILED SYSTEM ARCHITECTURE:
More detail as to the layout of the system architecture.

I. CWS
This system uses a subscription method where it establishes a connection with a
new client process and waits for commands, which it decodes and then places onto
a queue which the main thread uses to distribute commands to the three
subsystems: Cache, SQL writer and disk writer. Any command to write to disk
results in the caching of the binary data in the Cache where any subsequent
requests to access that file will be handled until the file has been written to
disk. Caching is handled by a multithreaded structure that stores the binary in
indices as objects and two management threads handle weighing accesses versus
time in cache to determine which cache objects to release and which to keep, and
in some cases which to transfer to a longer term cache.

II. NetAI/S
This system really consists of three subsystems. The first subsystem (AI)
handles load balancing, request routing and where to write files onto the disks.
The AI also handles the transfer of data between servers and between different
tiers. The second subsystem handles the file system and accounts. It connects to
SQL and can display or modify the accounts, workgroups and special permissions
for files, folders and users. The third subsystem handles encryption, session
keys and any other security measures that the Lotus system must use to meet a
requirement. This third subsystem will be designed to be easily extensible
depending on the user’s requirements. The NetAI/S, like the CWS, uses a
subscription system and places all requests in a queue to be handled by a main
thread.

III. Client Handler
This system consists of two parts. The first part enforces permissions and
prevents the client from getting data or making requests it should not be able
to. All communications between the Client Handler and NetAI/S or CWS are
filtered through here. The second subsystem interacts with the first subsystem
and handles queries through SQL, changing permissions on files (where user is
allowed to), searching for files, directory walking and logging in and out from
the system. An instance of the Client Handler is spawned for each client
connection and also stores data like current directory and user permissions (for
enforcement).

IV. Administration
This system allows the complete control of all other subsystems through special
UDP connections in every subprocess providing full access to the subsystem’s
internals. This port also serves to send the Administration system errors and
important information that should be entered into the log files. The
Administration system, when its design is finalized, will be fully featured and
allow a complete view of what is going on over the entire network.

V. Socket Server
This system acts as an independent server, handling requests to connect and
checking user account/password before spawning an instance of the Client Handler
and redirecting the client connections to it. The Socket Server also handles
determining which ports are free and how to distribute them.

VI. Dependencies
The Lotus IDVFS depends on two specifically contructed SQL databases and
therefore needs an SQL server present on each machine it is installed in.
Likewise, Python must be installed on every system. Currently only one python
package is required: mysql-python. The only OS requirements are that the system
needs to support Python, since a major goal is portability.

VII. Permissions Structure
The permissions system is different than you would normally find on a server or
file system. Users, files, folders and workgroups all have their own sets of
permissions and options. Files, for example, have the ability to require a
password to gain access, and folders have permissions defining (for a user) the
default state of the folder, the files within the folder, and the subfolders.
States can range from fully mofifiable to invisible. Workgroups can be assigned
to users to enforce a default set of permissions, and individual permissions for
a user can override other permissions (but not passwords). The system as
designed is a bit lengthy to describe here, but offers complete protection and
customizability not found in the simple permissions of Unix or Windows.


CASE STUDIES:

Case Study I: Web Application Framework
Suppose you are trying to build a web application that holds a public database
of Python modules where registered users can search categories of code and
download modules. Obviously a limitation must be applied to this database: Only
the submitter of the code should be able to modify or delete the module from the
database (excepting administrators). This database is expected to be large and
needs to be distributed across several severs that are not necessarily in the
same geographic location.

Solution: By installing the Lotus IDVFS on each of the servers, no matter their
location, you can create a database with all the features needed. The only thing
that really needs to be coded in this situation is the web interface (which acts
as the client).


Case Study II: Chain of Photo Stores
This is the problem in which the Lotus IDVFS has its roots as a solution. The
owner of a photo chain in the midwest, who happens to be a relative of a close
friend, had a problem with a huge database of photoshoots he had to maintain and
was concerned about employees that had been sometimes stealing the photographs
and selling them online. The owner needed a system where he could have a set of
geographically separated servers working together to handle the load from the
stores, and store the photos with encryption so that employees would have a
harder time stealing. In addition, the owner wanted a system where he could
monitor the activity of his employees and control their access to different
photo shoots; even denying former employees access should they try to access the
system from non-company computers.

Solution: The Lotus IDVFS simply covers all these problems. The only thing that
the owner would need to create after installing the Lotus IDVFS on his servers
is the client application to be installed on company computers. Note: Though the
idea for the Lotus system was derived from that discussion, I was never involved
in creating a solution for the actual company, and the owner’s requirements did
seem to err on the side of paranoia. Nevertheless, the Lotus IDVFS does offer
the complete solution to this scenario.


Case Study III: The Home Network
Consider a person having five or more computers on their home network, all with
varying amounts of storage and capability. One of these is a laptop, which the
owner would like to sync with the network whenever he is at home (automatically
backing up the replaced files). Also, the person has children that should have
restricted access, and an easy interface to access their files from.
Additionally, this person wants to run a web application off the network and
have it serve webpages, etc.

Solution: The Lotus IDVFS can be installed on all systems and be configured in
such a way to easily satisfy the needs of the family with user accounts, private
files, and syncing of systems that come on and off the network. A simple gui
interface can easily be made as a client and connect to the network. Also, a
second client can be created to serve web applications, with different
capabilities than the GUI client. Note that the Lotus IDVFS can be configured to
use clients of  many different types as long as they all conform to the API and
each have a user account created. The system can even act as a distributed web
server should you really want that kind of thing.


More information about the summerofcode mailing list