jGrid: Grid Management Software for Java

Current release: v0.53.

What is jGrid?

Aside from being a play on the name "Hagrid", jGrid is software that manages parallel processing across a collection of Java VMs. The software employs a very simple multithreading model--all large processing jobs must be broken up into smaller chunks by jGrid clients before being submitted to the grid manager. The grid manager is then responsible for scheduling these jobs for execution on one of the clients on the grid, and communicating the results of that execution back to the grid client. It is also responsible for handling failures of any number of worker machines.

Author's Note: jGrid is a proof-of-concept. Thus, it lacks some of the professional touches such as real documentation and proper object separation. It also lacks serious authentication and a facility to prevent tampering of data, code, or results. These could be added to a future version, should there proof enough interest.

Design

Below is a discussion of how the various pieces of jGrid work together. The jGrid package is split into four main components--jobs, workers, clients, and managers. Each section below discusses one of the three components.

Jobs

Jobs are the basic unit of work in jGrid. These are serializable objects that contain code and data ready for transportation across the wire and onto a worker's computer. Each job represents a small, easily parallelizable portion of the original big task. By wrapping code and data into an object, it is quite easy to manage the cluster efficiently. Perhaps the only disadvantage of this approach is that the original client generally becomes responsible for splitting and reassembling the chunks of the job.
Clients

For any easily parallelizable task, big or small, jGrid can help get computation completed faster. All that jGrid requires is that the client split up the huge task into a bunch of smaller jobs, connect to the grid manager, and submit those jobs to the manager. The grid manager will send the jobs out for processing, and asynchronously report the results back to the client.

The client interface to the jGrid system is quite simple. The client connects to the manager via RMI and submits a batch of jobs with a callback. When each job is done, the manager uses the callback to inform the client of the results of the job.
Workers

Workers live on the front lines of jGrid--there are large numbers of them, and they do all the work for the cluster. Because the cluster is written in Java, it is very easy to add new clients to the grid! Simply find a machine that has the Java 2 platform installed and network access to the manager, and you can put your idle CPU cycles to use! It doesn't matter if you have Linux, Windows, Solaris or OS X; jGrid runs on them all!

Workers interface with the manager in similar manner to clients. Upon startup, each worker connects to the grid manager and polls it for available work. If there is work to do, the manager sends the worker some client's job object. The client then processes the job and signals the manager when the work is complete. Should the client wish to abort a job, there are facilities for informing the manager and having the work rescheduled onto another worker.
Managers

The Manager component is the most important part of the grid--without it, jobs would never get routed and work would not get done. At the heart of this component is a big table of the job object, which client it came from, the call back for when the job is done, and which worker (if any) is currently handling the job. Needless to say, calls coming in from clients and workers simply manipulate this table.

For quick access to data, there are three lists maintained in the manager. The main list is a simple table of every job that is under this manager's control. Second, there is a list of jobs that are currently being processed by workers in the grid. Finally, there is a list of jobs that are waiting to be processed. When a job comes in, it is placed on the first and second lists; when it is sent out, it is put on the third list and taken off the first. Upon completion, the job is removed from all lists.

However, the Manager cannot be nearly as simple in design as the client and the worker. Since it is responsible for ensuring that all jobs are eventually processed, that means that it must be able to detect workers that simply disappear off the planet and reschedule jobs. Currently the manager uses this algorithm: When a job is scheduled to run on a worker, the processing expiration date is calculated. Every thirty seconds, a cleanup thread is invoked on the manager's job list. Expired jobs are placed back in the list of jobs that need to be run.

This design reminds me of the Command pattern...

Using jGrid

As laid out earlier, there are three different components that need to be set up before a jGrid can be put online. Generally, all three components have these platform requirements:

Java 2 Standard Edition (offsite)
64MB of RAM
500KB of disk space

NOTE: The details of remote class loading have not yet been worked out. For now, class files for all submitted jobs will have to be placed in some publicly accessible location. One suggestion is to put them on a networked filesystem that client, manager, and worker can all access. Another would be to put them on a web server and let the RMI classloader work out the details of downloading them over HTTP; use -Djava.rmi.server.codebase=some_URL to tell the system where the classes can be found.

To install jGrid, simply download the tarball and decompress it. If you are on Unix/OSX/Linux, use tar -xzf jGrid.tar.gz; Windows users can use WinZip (offsite). This will create a directory jGrid/. All instructions below assume that you have run cd jGrid/. And yes, you can run all three on the same machine.

Managers

Since the Manager controls the operation of the entire grid, it must be started before all the other pieces. As with all other parts of jGrid, there is a console-only and a GUI version of this program. The console version will give you raw output and other debugging information; the GUI program tidies up all of the information data and presents it as a neat table describing each job, its status, and who (if anyone) is working on it.

NOTE: Other machines MUST be able to connect to TCP/IP ports on this machine! Configure your firewall/router as appropriate, if necessary. Otherwise, the grid will not work.

To start the command line manager, do this:

$ java -cp . -Djava.security.policy=java.policy grid.worker.Main

To start the GUI manager, do this:

$ java -cp . -Djava.security.policy=java.policy grid.worker.App

Please note the name or IP address of this machine. You will need it when you want to connect workers and clients to this grid's manager. Now let's move on to adding some workers.

Workers

There are three ways to run a worker: command-line, application, and applet. This is to maximize flexibility in getting workers onto the grid.

To start the command-line program, run:

$ java -cp . grid.worker.Main
worker> add servername 1

This creates one worker on your machine and connects it to the manager. To exit the program, type quit at any time.

To start the GUI application, run:

$ java -cp . grid.worker.App

You will see a program window appear with a prompt to add a worker. Fill out the text fields appropriately, and click Add.

Now there is a worker thread, and you can move on to starting a client.

Clients

This is perhaps the most difficult part of setting up the grid--programming it to do what you want. There are several steps involved:

You must create a job object that implements the grid.Workable interface, and the execute method that goes with it.
Next, create a client callback class that extends UnicastRemoteObject and implements JobFinishedCallback and Serializable. All of thse heavy declarations are necessary because this callback will be given to the manager when jobs are submitted, but the manager has to call back to this client. Therefore, one must extend the handler code to do something when each job finishes.
Finally, create a class that connects to the manager service, creates grid.manager.Job objects and invokes scheduleJob on the job. Note that you can put the jobs into a List and send them as one big batch.
Build the code and run it. Watch your jobs head off!

Here is sample client source code if you want to get a head-start.

Test-Drive

If you want to turn your computer into a worker bee, there is a rough webapplet to do so. Load jGrid as a Web App.

Future Steps

jGrid has several areas that need improvements. Keep in mind that the software exists as it is now merely to prove that it can be done, and not necessarily done *well*. That said, I have a list of things that may or may not be integrated into the package. Generally speaking, these improvements are to enhance the security or the reliability of the package; aside from that, I would prefer to keep the code body as small and efficient as possible.

The first area of improvement concerns the RPC facility. The current design of the worker requires that, when idle, it continuously poll the manager. Several revisions ago, the design was that the worker would call into the manager, where the call would block until a job became available. RMI, alas, has no way of cancelling the server-side code if the client should disconnect. This needs to be fixed. Moreover, RMI requires a special RMI directory to be running on the manager, and the user may not be able to do this.

Security is a another feature that is quite high on the to-do list. jGrid does not provide ANY security outside of whatever the Java security model affords. There is no way to authenticate workers or clients, and no encryption is provided to protect job objects or job results as they pass across the wire.

Performance-wise, the grid software does pretty well. My informal testing has indicated that if a clever programmer makes the job processing time large relative to the time required to talk to the server, then cluster utilization can scale nearly linearly. Unfortunately, there exists no way of matching a job size to a cluster node's capability to process the node. Were this possible, one could submit jobs of varying size and have the manager map them to an appropriately powerful machine. Currently, a worker will receive whatever job happens to be at the front of the list of unprocessed jobs. Lastly, there is no way to control CPU usage on the node, except at the level of the underlying OS.

Conclusion

So that is my cluster software for the Java platform. Feel free to send me feedback, suggestions, patches, or other encouragement. I hope you enjoy it!

jGrid: Grid Management Software for Java

What is jGrid?

Design

Jobs

Clients

Workers

Managers

Using jGrid

Managers

Workers

Clients

Test-Drive

Future Steps

Conclusion