The goal for this project is to implement a reliable distributed system.

You should do this project in groups of 3 or 4.

Notes

The service is due on Thursday, March 22nd at midnight, and test results and full writeup are Thursday, March 29th at 5 pm.
Description of what to turn in was updated March 22nd

Service

The service to provide is a simple key-value store, where keys and values are string. You service should be as consistent as possible, so that requesting a value should return the most recently value set as often as possible. Furthermore, your service should tolerate failures of a process, node, or network.

The service should run on between 1 and 4 nodes, and should continue to provide service when only some of the nodes are available. You can assume the complete set of nodes is provided when your service starts. A client should be able to request service from any machine running your service.

You can assume that at least one node will always be running, so you do not need to store data persistently (unless you want to). You can also assume that the total size of all the data will be small (less than 10 megabytes).

Fault tolerance

The goal of this work is to get experience implementing consistent, fault-tolerant services. Hence, the primary criteria for your work is the ability whether your service can continue to provide consistent results in the presence of failures.

To implement this, you will need to implement some form of replication, which ensure that data is available even when one of the machines goes down.

Here is a survey paper on replication, and a PowerPoint presentation on replication in databases.

You may also want to implement some kind of failure detector, so your service knows when its peers are unavailable. Here is a bibliography of failure detectors.

You may find it helpful to read about Amazon's Dynamo system.

The four servers all run in distinct virtual machines. The client runs on a standard CSL workstation. The client and the four VMs are the only machines in the system.

Specification

The specification of the service is on the Project 1 Specification page. This will initially be a proposed specification, in that the class may edit it to converge on a better one.

You can assume that:

Keys are valid printable strings without special characters and are 128 or less characters in length. They cannot include the "[" or "]" characters.
Values are valid printable strings without special characters (or UUencoded) and 2048 or less characters in length. They cannot include the "[" or "]" characters.

Failures

Your service should tolerate:

System crash failures by stopping/unplugging the machine running your service
Process crash failures by killing the processes implementing your service
Network failures where the network connecting two machines is cut (virtually or physically).

In cases where the servers do not fail, your service should provide strong consistency even if the client is unable to communicate with all the servers.

Implementation

You may use any implementation language for the server and client, as long as it implements the required command line interface and protocol. For example, you can use regular sockets, RPC Google's protocol buffers for communication between your servers, or use the existing HTTP implementation. We will provide a simple implementation of a web server from which you may start, if you wish.

You should write a script to start/configure your server.

Complexities

Testing platform

You should test your code with VirtualBox. To simulate multiple physical machines, you can run multiple virtual machines simultaneously.

There are pre-configured 32 bit Ubunto 11 images available in AFS: /p/course/cs739-swift/public/projects/p1/images. The source iso is in the adjacent 'iso' directory. Instructions on how to use the image are in a readme.

Hints

Here are some standard distributed system techniques you may want to look at:

Leases
Quorom-consensus replication
|The CAP theorem -- you cannot build a system that is always consistent, always available, and can survive partitions
Failure detectors

Testing

In addition to developing the service, you will also develop tests that should work with any service that implements the specification. Each group will be asked to test the services of a few other groups.

Your tests should be able to verify the consistency/fault tolerance properties of a service implementation. For example, if you test a project that claims to provide perfect consistency and availability in the presence of partition failures, you could try concurrently writing to both sides of the partition and look for inconsistencies.

These are the tests that were used last year:

Throughput - Randomly read/write 100 keys to different servers 500-5000 times from 10 processes
Base consistency - Write 40 keys in order to different servers 500-1000 times. Results are consistent if there is a total order of requests: the chain of (old value, new value) pairs contains all values written.
Failure - Same as base consistency tests, but kill one node partway through. All requests are sent to the first two servers and the 4th one is killed.
Manual partition - Insert a key, verify it has the same value when read with 2 different servers provided at command line. Then create a partition, and update the value at servers on either side of the partition. Check if reads to those servers return same value or not. Unpartition servers, and then read to see if values have converged
Stress partition - Base consistency test, but partition servers partway through and then unpartition. Clients can see all nodes but servers are partitioned.

Scripts

Scripts for copying files up to the virtual machines and for launching servers are in the same directory and names update.sh and launch.sh
A script for partitioning the network is available in ~cs739-1/public/projects/p1/scripts/dopartition.sh.

What to turn in

On the first turn-in day, turn in your service for other people to test

This should include:

Your service, installed in a virtual machine
Scripts for launching your service
A short read-me describing salient properties of your service needed for testing

You should send email to the instructor a location where other groups can download your service in a virtual machine.

On the second turn-in day

A short (few pages) description of your system, explaining your design and how you trade off consistency, availability, partition tolerance, and performance.
A description of your tests and how they test for salient properties.
The results of running your tests on the services of your code and two other groups.

It may be that your tests don't work on other services, or that others services don't work for your tests. You should cooperate with the other groups to fix both your services and tests to be as useful as possible. When you turn in the results, please include a short description of changes you had to make to your tests and changes other groups had to make to their service.

Evaluation

Performance is not the primary concern of this project. We are instead interested in the ability to return consistent results under failure conditions.

the availability of the service: the ability whether your service can continue to provide consistent results in the presence of failures
the consistency of your service, measured by how often it returns incorrect (out of date) values
The quality of your design (did you make good choices) and your write up (do you explain and motivate your design choices).

In addition, we will consider (as secondary considerations):

the scalability of the service; meaning the impact of having more copies on performance
the efficiency of your service, measured by bytes/messages transmitted between servers

For the writeup, we will look for these things:

Does the writeup adequately describe the project?
Does the design make sense?
Does the described tests adequately test the guarantees of the design?

Project1