|
Instructor
Lecture:
|
Main /
DistributedSystemsProgrammingThe goal for this project is to implement a reliable distributed system. You should do this project in groups of 3 or 4. The project is due on Thursday, March 25th. Clarifying notesMarch 24th:
March 23rd:
March 18th:
The client can read in the file "servers.txt" to find the addresses of servers. The client command line will take a comma-separated list of "-h hostname:port,hostname:port" parameters to indicate servers that are likely to be running. In testing, we will invoke your client program from a script. While your service should be accessible from an unintelligent client (that just sends requests to any node), you can assume that high performance or high consistency require running your client.
You can assume that:
ServiceThe service to provide is a simple key-value store, where keys and values are strings. You will use the HTTP protocol for access to the service. You service should be as consistent as possible, so that requesting a value should return the most recently value set as often as possible. Furthermore, your service should tolerate failures of a process, node, or network. The service should run on between 1 and 4 nodes, and should continue to provide service when only some of the nodes are available. You can assume the complete set of nodes is provided when your service starts. A client should be able to request service from any machine running your service. You can assume that at least one node will always be running, so you do not need to store data persistently (unless you want to). You can also assume that the total size of all the data will be small (less than 10 megabytes). Fault toleranceThe goal of this work is to get experience implementing fault-tolerant services. Hence, the primary criteria for your work is the ability whether your service can continue to provide consistent results in the presence of failures. To implement this, you will need to implement some form of replication, which ensure that data is available even when one of the machines goes down. Here is a survey paper on replication, and a PowerPoint presentation on replication in databases. You may also want to implement some kind of failure detector, so your service knows when its peers are unavailable. Here is a bibliography of failure detectors. You may find it helpful to read about Amazon's Dynamo system. FailuresYour service should tolerate:
ImplementationYou may use any implementation language for the server and client, as long as it implements the required command line interface and protocol. For example, you can use regular sockets, RPC Google's protocol buffers for communication between your servers, or use the existing HTTP implementation. We will provide a simple implementation of a web server from which you may start, if you wish. You should write a script to start/configure your server. Your server should read in a file in the current directory named "servers.txt" to learn of the other instances of the service. This file will contain the set of servers and the ports to listen on. The first line will refer to the server itself: For example 192.168.2.2:8081 192.168.2.3:8081 192.168.2.4:8081 192.168.2.5:8081 ProtocolThe protocol is extremely simple. Clients messages are: To retrieve a value, a client sends: GET [key] HTTP/1.1 To set a value, a client sends: GET [key]=[value] HTTP/1.1 The response contains HTTP headers followed the response: HTTP/1.0 200 OK Server: server type Content-Length: length Content-Type: text/plain [key]=[value] with the old value of the key. If there is no value, [value] should be replaced with NIL: [key]=NIL You can assume that:
If the server must return an error, it should return a different error code using HTTP, something like: HTTP/1.0 500 Internal Server Error ClientsStandard web clients (as we learned from consistent hashing and LARD are not good at choosing between multiple replicas. We will provide a initial code
The client can also read in the server.txt file from the current directory to learn about the addresses of all servers. If only a key is specified, you should return the current value for the key or NULL if there is no value. If a key and value are specified, then you should set the value and return the old value, if there is one, or NULL. If more than one host name is set, your client should choose which host to use. In testing, we will invoke your client program from a script. The output of the client should be just the data returned from the server, or an error message formatted like this: ERROR: error message While your service should be accessible from an unintelligent client (that just sends requests to any node), you can assume that high performance or high consistency require running your client. Testing platformYou should test your code using VMware or QEMU. To simulate multiple physical machines, you can run multiple virtual machines simultaneously. HintsHere are some standard distributed system techniques you may want to look at:
EvaluationWe will test the consistency of the system under heavy load to each server individually, and under the failure conditions described above. In addition, we will look at the downtime (period of unavailability) during failure.
In addition, we will consider (as secondary considerations):
|