CS 739 - Distributed Systems - Spring 2012 | Main / DistributedSystemsProgramming

The goal for this project is to implement a reliable distributed system.

You should do this project in groups of 3 or 4.

The project is due on Thursday, March 25th.

Clarifying notes

March 24th:

If your virtual machine is large (> 500 MB), then I would prefer to have a single copy that I can then duplicate three times. If your virtual machine is smaller, then it would be easier for me to have 4 copies of it, so I do not need to configure them all myself. Please include in your turnin email what you choose.
I am planning on using DHCP to assign dynamic addresses to your virtual machines. If you choose static addresses, please choose them in the range 192.168.8.160-250.

March 23rd:

The project is due at 9 pm on Thursday, March 25th. Please send instructions on obtaining of your virtual machine to the instructor. In addition, if any additional work is needed to configure copies of the VM, or to start your service, include those instructions as well.
The writeup is officially due with the project, but can be turned in until 9 pm on March 28th with no penalty.
If you do not use the provided VMs, I request that you at least compress your virtual machine. Ideally, it should be less than 1 GB.
A 32-bit Ubuntu server virtual machine can be found in ~cs739-1/public/ubuntu-32-739.tgz. Directions for its use (please let me know if they do not work) are in README-VM.txt in the same directory.
A script for partitioning the network is available in ~cs739-1/public/dopartition.sh. To use it, you'll have to edit the sudoers file by running sudo visudo and adding the line: %admin ALL=(ALL) NOPASSWD:/sbin/iptables at the end.

March 18th:

In addition to turning in your virtual machines, you should also turn in:
1. A short (few pages) description of your system, explaining your design and how you trade off consistency, availability, partition tolerance, and performance.
2. A test script that invokes the client program through its documented interface, This program should try to expose the limits of the service. You can assume that some of the virtual machines/servers will be killed during the test or partitioned. It should print out information about the availability (how long it takes to reply when there is a failure) and consistency (when results are not strictly consistent)
There is some confusion about the hosts/ports specified on the command line. The client should take a list of comma-separated hostname:portnumber pairs. You can assume that an external failure detector thinks these nodes are running, although they may fail between when the command is launched and when the client starts sending packets. If you want ignore the list, or just parse the first element in the list and ignore the others, that is fine. The client need not make use of the "servers.txt" file.
A 64-bit Ubuntu server virtual machine can be found in ~cs739-1/public/ubuntu-739.tgz. Directions for its use (please let me know if they do not work) are in README-VM.txt in the same directory.
Scripts for copying files up to the virtual machines and for launching servers are in the same directory and names update.sh and launch.sh

The client can read in the file "servers.txt" to find the addresses of servers. The client command line will take a comma-separated list of "-h hostname:port,hostname:port" parameters to indicate servers that are likely to be running.

In testing, we will invoke your client program from a script. While your service should be accessible from an unintelligent client (that just sends requests to any node), you can assume that high performance or high consistency require running your client.

The four servers all run in distinct virtual machines. The client runs on a standard CSL workstation. The client and the four VMs are the only machines in the system.
The client is stateless, yes (but it could certainly download data from a server if it wanted to)
The hostnames are servers, and the client can only talk to those servers.

You can assume that:

Keys are valid printable strings without special characters and are 128 or less characters in length. They cannot include the "[" or "]" characters.
Values are valid printable strings without special characters (or UUencoded) and 2048 or less characters in length. They cannot include the "[" or "]" characters.

Service

The service to provide is a simple key-value store, where keys and values are strings. You will use the HTTP protocol for access to the service. You service should be as consistent as possible, so that requesting a value should return the most recently value set as often as possible. Furthermore, your service should tolerate failures of a process, node, or network.

The service should run on between 1 and 4 nodes, and should continue to provide service when only some of the nodes are available. You can assume the complete set of nodes is provided when your service starts. A client should be able to request service from any machine running your service.

You can assume that at least one node will always be running, so you do not need to store data persistently (unless you want to). You can also assume that the total size of all the data will be small (less than 10 megabytes).

Fault tolerance

The goal of this work is to get experience implementing fault-tolerant services. Hence, the primary criteria for your work is the ability whether your service can continue to provide consistent results in the presence of failures.

To implement this, you will need to implement some form of replication, which ensure that data is available even when one of the machines goes down.

Here is a survey paper on replication, and a PowerPoint presentation on replication in databases.

You may also want to implement some kind of failure detector, so your service knows when its peers are unavailable. Here is a bibliography of failure detectors.

You may find it helpful to read about Amazon's Dynamo system.

Failures

Your service should tolerate:

System crash failures by stopping/unplugging the machine running your service
Process crash failures by killing the processes implementing your service
Network failures where the network connecting two machines is cut (virtually or physically).

Implementation

You may use any implementation language for the server and client, as long as it implements the required command line interface and protocol. For example, you can use regular sockets, RPC Google's protocol buffers for communication between your servers, or use the existing HTTP implementation. We will provide a simple implementation of a web server from which you may start, if you wish.

You should write a script to start/configure your server.

Your server should read in a file in the current directory named "servers.txt" to learn of the other instances of the service. This file will contain the set of servers and the ports to listen on. The first line will refer to the server itself:

For example

  192.168.2.2:8081
  192.168.2.3:8081
  192.168.2.4:8081
  192.168.2.5:8081

Protocol

The protocol is extremely simple. Clients messages are:

To retrieve a value, a client sends:

    GET [key] HTTP/1.1

To set a value, a client sends:

   GET [key]=[value] HTTP/1.1

The response contains HTTP headers followed the response:

  HTTP/1.0 200 OK
  Server: server type
  Content-Length: length
  Content-Type: text/plain
  [key]=[value]

with the old value of the key. If there is no value, [value] should be replaced with NIL:

  [key]=NIL

You can assume that:

Keys are valid printable strings without special characters and are 128 or less characters in length. They cannot include the "[" or "]" characters.
Values are valid printable strings without special characters (or UUencoded) and 2048 or less characters in length. They cannot include the "[" or "]" characters.

If the server must return an error, it should return a different error code using HTTP, something like:

  HTTP/1.0 500 Internal Server Error

Clients

Standard web clients (as we learned from consistent hashing and LARD are not good at choosing between multiple replicas.

We will provide a initial code ~cs739-1/public/P1 that provide a simple command line to send HTTP requests. You should modify the client to implement these options:

-h host:port[,host2:port2[,host3:port3]] specify one or more host names and port numbers
-k specify the key
-v specify the value

The client can also read in the server.txt file from the current directory to learn about the addresses of all servers.

If only a key is specified, you should return the current value for the key or NULL if there is no value. If a key and value are specified, then you should set the value and return the old value, if there is one, or NULL.

If more than one host name is set, your client should choose which host to use.

In testing, we will invoke your client program from a script. The output of the client should be just the data returned from the server, or an error message formatted like this:

  ERROR: error message

While your service should be accessible from an unintelligent client (that just sends requests to any node), you can assume that high performance or high consistency require running your client.

Testing platform

You should test your code using VMware or QEMU. To simulate multiple physical machines, you can run multiple virtual machines simultaneously.

Hints

Here are some standard distributed system techniques you may want to look at:

Leases
Quorom-consensus replication
|The CAP theorem -- you cannot build a system that is always consistent, always available, and can survive partitions
Failure detectors

Evaluation

We will test the consistency of the system under heavy load to each server individually, and under the failure conditions described above. In addition, we will look at the downtime (period of unavailability) during failure.

the availability of the service: the ability whether your service can continue to provide consistent results in the presence of failures
the consistency of your service, measured by how often it returns incorrect (out of date) values

In addition, we will consider (as secondary considerations):

the scalability of the service; meaning the impact of having more copies on performance
the efficiency of your service, measured by bytes/messages transmitted between servers