Main | Web caching with consistent hashing »

Locality-Aware Request Distribution

Locality-Aware Request Distribution, Vivek Pai, Guarav Banga, ASPLOS-VIII,

Review for this or consistent hashing due Tuesday, 1/25

Comments

Summary: This paper proposes content-based load distribution mechanism for cluster-based networks which achieves high cache hit rates and even load distribution.

Problem: For scalability and throughput reasons, high volume websites are designed as multi-node clusters in which a client-facing load balancer node (front-end) that distributes requests to a bunch of nodes in the backend. Traditionally, request distribution mechanisms had been load-based, which have poor cache-hit rates and are heavily disk-bound.

Paper Contributions: This paper proposes content-based request distribution (as opposed to load based). If we could forward requests to specific backends based on the request attributes, we could have better cache hit-rates, and other advantages like easier specialization, scalable secondary storage of backend nodes.

LARD: Specifically the authors prototype and analyze a variant of this idea - the locality based request distribution (LARD) which aims at exploiting the cache-hit advantage. Different request types are associated (using a hash-table) with different backend nodes based on the request attributes (say URL, MIME-type). The front-end dispatcher forwards requests to back-end nodes based on this association.

LARD Implementation challenges: This policy, however, could run into 2 problems: 1. Load-imbalance when a fewer types of requests are more frequently requested than others. 2. packet lookup in the front-end dispatch phase could become a bottleneck.

To counter the first problem, the dispatcher keeps tracks of the load in the backends (number of open connections) and if the disparity in load between nodes is too much, a lesser loaded node is chosen to serve newer requests of the same type. Also to keep the load from getting out of hand, a dispatcher restricts the total number of connection. It wasn't exactly clear to me how the choice of this upper limit affects the system and ensures even load distribution. As for the second, the authors argue that a low overhead handoff solution is achievable and to prove the point by briefly describing the a low overhead TCP technique they used in their prototype.

The authors show how to extend this mechanism to support replication (LARD/R) i.e. a setup where more than one node can serve requests of the same type, where the hash-table contains a serverSet (an array of servers) rather than just one server.

Evaluation: The authors evaluate the strategy using a simulation model and a prototype. Access logs from the Rice university and the IBM website were used for real-world representative workloads. LARD is compared other request distribution mechanisms such as WRR (with and without GMS), LB and their variants. Metrics such as effective throughput, overload-delay, cache-miss ratios are studied.

The simulation results shows LARD outperforms every other mechanism wrt every metric. Even in cases highly favorable to WRR (very high demand of very few resources), LARD's performance comes very close to that of WRR. Thus, LARD combines the best of both worlds: high cache reuse and even load distribution. Results from prototype indicate that the handoff overhead is, as the authors claim, of less overhead and does not become a bottleneck. Also, other results from prototype are in agreement with the simulation results.

Closing thoughts: We can see how far the idea has carried itself into this day - dichotomous storage of dynamic server content and static resources (images, CSS, Javascript etc.), CDNs, streaming servers (specialization). Thus, the problem can be seen as being relevant even today to some extent. However, most of the Web is dynamic these days. The LARD, although not applicable directly, might have motivated ideas such as deep-packet classification (and intelligent routing), request redirection, memcached (extensive caching) etc.

The authors of this paper develop a locality-aware request distribution (LARD) scheme for a server system in which a front-end server routes requests to multiple back-end servers for processing in a manner that minimizes the working set for each back-end and therefore maximizes the ratio of cache hits to misses.

A primary goal of any server administrator is provide maximum availability with the minimum amount of hardware, and the LARD system was designed to be a lightweight but effective method of furthering this goal. The authors recognized that a content-agnostic request distribution scheme (such as the weighted round robin (WRR) algorithm that was a typical choice at the time) was missing an opportunity for greater cache utilization that could be achieved by routing duplicate requests to the same back-end server each time. They set out to take advantage of this possibility while retaining most of the benefits of a typical load-balancing scheme.

Pai, et.al provide descriptions and justification for the LARD scheme and an extension, LARD with replication. Basic LARD assigns each request identifier (typically a URI) to one back-end server such that each time the object is requested, that specific server handles the processing. This has the effect of partitioning the working set for the system over the back-ends, whereas with a WRR scheme each back-end server is subject to the entire system's working set. They also develop a reasonable rebalancing formula that allows reassignment of requests in order to prevent one back-end from being overwhelmed in case it has been assigned a set of particularly popular requests. LARD with replication provides support for assigning multiple back-ends to a request that is frequent and/or resource intensive enough to warrant it.

The most glaring deficiency of this paper is the reliance on simulations with suspect assumptions rather than real-world testing. Specifically, they state that "the front-end is assumed to have no overhead and all networks have infinite capacity," which implies that the simulations could ignore bottlenecks that may limit the effectiveness of their scheme in practice. They limit the simulation inputs (requests with which to stress the server cluster) to the access logs of two particular websites, which may or may not be representative of a typical load. In the end, the experimenters do set up and test with a working prototype, but this is limited to one particular hardware arrangement, and is subjected to only one set of test requests.

Despite the lack of real testing, it is reasonable to speculate that the LARD system could provide great efficiency benefits for a web server backed by a cluster of processors. They seem to be primarily focused on the utility of their new methods with respect to a high-traffic web server cluster, and they present a very plausible case that LARD can make such a system work more smoothly with very little overhead or additional complexity.

Summary
In this paper the authors design a system for a front end server to distribute requests among back-end servers. The requests are partitioned by the name of the resource requested, with an additional mechanism to redistribute load away from heavily loaded back-end servers.

Problem
This paper looks at how to handle load balancing in a cluster of servers. It is assumed that:
- There are enough requests to warrant distributing the load across a cluster of back-end servers.
- Every back-end server is capable of individually handling a request for any resource.
- The back-end servers cache requested resources in local memory.
- The size of the working set exceeds the memory cache of an individual back-end server (the total size of the actively requested content would not fit in the memory of a single server).
Since a back-end server cannot store all the resources in memory, cache misses on these servers will result in longer delays as the resource is fetched from a slower storage. The paper assumes the resources are stored at each back-end server on a local disk, but the problem may be similar for resources stored on a network filesystem or database.

Contributions
Round robin style load balancing would result in high cache miss rates on the back-end servers. To reduce these misses, the authors propose a system where a front-end server distributes requests by the name of the resource requested. The front-end keeps a mapping of names to bank-end servers. This strategy alone would result in hot pages overloading their respective back-end servers. To solve this problem, the authors have the front-end estimate load by tracking the number of outstanding requests, and if a server is overloaded, further requests are redirected to a new server. Finally, the front-end server needs to be able to handoff connections to back-end servers after it has determined the name of the requested resource. For this the authors implemented an efficient handoff extension to the TCP protocol, and used this modified TCP on the front-end and back-end servers.

Discussion
I think some of the authors’ techniques the authors used may be applicable today, but I am not sure that the specific problem the authors describe is very relevant. Large websites today probably would not have a cluster of equally capable back-end servers, but would rather distribute the data across many servers. Further, many websites today are very dynamic and have interactive user sessions which makes caching more complicated. Perhaps if the front-end took into account sessions and static/dynamic content it would be more applicable.
The authors’ TCP handoff mechanism is useful and I could see similar systems being used today. Sometimes proxies are placed in front of servers for security. The front-end is similar to a proxy, but handoff offers better performance since on the incoming packets are examined and forwarded; the outgoing packets go directly from the back-end server to the client.

This paper presents LARD (locality-aware request distribution), a strategy for distributing requests among a number of backend servers which aims to balance load while simultaneously achieving high hit rates for main memory caches on backend servers by ensuring a high degree of locality for requests for a particular target. The performance of LARD is compared with weighted round-robin (designed for load balancing) and locality-based request distribution (designed for high cache hit rates).

The environment considered by the authors is a cluster serving requests for static data. Such a cluster consists of a single front-end node which accepts requests from clients and hands them off to one of several back-end servers to be processed. For simplicity, the authors assume that the back-end servers are identical, and each are capable of servicing any incoming request (i.e. each server has the entire system's data on secondary storage, but not necessarily in main memory), although it is noted that content-based request distribution does enable scalability to larger volumes of data by partitioning the database amongst servers, as well as specialization of servers to serve particular types of content (multimedia streams, for example).

The focus of the paper is the process by which the front-end node decides which back-end should process a particular request. The strategy for making these decisions should, on the one hand, keep any back-end nodes from becoming overloaded, as an overloaded node will take longer to serve any new requests and drag down the overall throughput of the cluster. On the other hand, high cache-hit rates for the back-end servers are very desirable, since requests can be served from a cache in main memory much faster than if they require a disk read. Practically, achieving high cache hit rates requires that a given server handles requests only for a subset of targets which can fit in the server's main memory cache.

A weighted round-robin strategy, where requests are distributed to back-ends in round-robin fashion, weighted by the load on each back-end, achieves good load-balancing but has no consideration of locality. If the working set is larger than the cache size of an individual back-end, frequent cache misses will occur.

A locality-based strategy, where requests are distributed to backends based on some static (usually hash-based) mapping, partitions the name space relatively evenly among back-ends, leading to good cache use if the partitions are smaller than the back-end cache size. Such a partitioning strategy may be a poor choice for load balancing, however, as a small set of targets may account for a large fraction of requests, in which case the back-ends which serve those targets will be disproportionately loaded.

LARD aims to combine the advantages of both approaches. A basic LARD front-end maintains a one-to-one mapping of targets to back-end nodes. For each incoming request, the front-end first checks if a mapping of the request's target to a back-end server exists. If not, then that target is assigned to the least-loaded back-end (load is measured as the number of active connections). If a mapping already exists, then the front end checks whether the load on the assigned server exceeds a "high load" threshold while another server is below a "low load" threshold, or whether the assigned server exceeds twice the high load threshold regardless of the load on other nodes. If so, the target is reassigned to the least loaded node. The front-end limits the number of admitted connections to keep the load on all nodes from rising to twice the high threshold, in which case LARD would degenerate and begin behaving like weighted round-robin. At the same time, enough connections are admitted so that it is possible for all nodes to at least exceed the low load threshold, thus keeping all nodes utilized and allowing a limited amount of imbalance to prevent unnecessary target reassignments. It is shown that the maximum load imbalance between two nodes is (2*High - Low). The authors discuss tuning of the high and low threshold for good performance.

Because it is possible that a single frequently-requested target can generate enough traffic to overload a single node, the authors propose an extended version of LARD with replication. In LARD with replication (LARD/R), the front end maintains a mapping from targets to sets of nodes (so that a single target may be served by multiple back-ends). Requests for a target are assigned to the least loaded node in the target's server set. If a load imbalance occurs, then the least loaded node in the cluster is added to the set. On the other hand, if a certain interval (20 seconds in the LARD prototype) has passed without the server set for the target changing, then the most loaded node in the server set is removed, thus reducing the degree of replication for a target when it is not being requested often.

The authors constructed a simulation of a web server cluster with a single front-end and a configurable number of back-ends, each of which consists of a CPU, locally attached disks, and a memory cache with a greedy-dual-size replacement policy. The inputs to the simulated cluster were traces from multiple departmental web servers at Rice University, and from IBM's main public web server. It is worth noting that, while both the Rice and IBM traces had similar numbers of targets and a similar overall dataset size, the IBM trace has much higher locality. The simulator was used to calculate the overall request throughput, cache hit rate, and node underutilization time. The request distribution strategies evaluated in the simulator were weighted round-robin (+ WRR with a global memory system), locality-based distribution (+ an idealized LB with a global cache), basic LARD, and LARD with replication.

Results from the simulator show that LARD and LARD/R are nearly as good as WRR for load balancing while achieving cache miss ratios close to those obtained for LB with global cache. The two variants of LARD achieved higher throughput than the other strategies for all cluster sizes above one node, with the system becoming essentially CPU bound (with linear speedup vs. number of nodes) for 8 or more nodes. Increasing the CPU and memory of the simulated nodes produced large performance gains for LARD and LARD/R, while they produced very little change for WRR, which is primarily disk-bound.

To experimentally verify the results of the simulation, a real cluster with one front-end and six back-ends was constructed. Performance measurements for each strategy were similar to the predictions from the simulator, with LARD/R achieving the best overall throughput and scaleup. The load on the front-end led the authors to estimate that a front-end with the same CPU as the back-ends could serve about 10 back-ends. However, most of the processing time on the front-end was spent doing connection handling and forwarding, which is independent for each connection and can be easily parallelized amongst multiple CPUs in an SMP configuration, allowing an SMP frontend to serve even more back-ends.

As part of the implementation of a LARD front-end, the authors also developed a mechanism for transparently handing off a TCP connection from the front-end to a back-end which ultimately services a request. Packet forwarding is very fast, and is done at the level of a network interface interrupt handler on the front-end.

LARD seems to be an effective strategy for achieving the benefits of caching and load balancing simultaneously when serving static content. The authors only briefly mention dynamic content, however, which could pose a more significant problem for locality. When content is dynamically generated in response to a request, it is difficult to know what stored data will be accessed without actually processing the request.

Summary:
This paper describes a mechanism for distributing requests made to a cluster that centers on improving cache locality. By sending specific requests to specific nodes, it aims to minimize the amount of disk overhead encountered in processing these requests, improving overall throughput.

Problem:
At the time of the paper’s writing, most cluster-based network servers only considered overall load-balance when handling requests, generally assigning incoming traffic to the least loaded server in a weighted round robin fashion without regard for the type of request being made. This sort of strategy tends to result in a high number of cache misses in the back-end, forcing the machine to access its disk to service the request. Because disk speed evolves noticeably more slowly than processor speed, this places a strict bottleneck on potential performance.

Contributions:
The paper introduces LARD, locality aware request distribution, in an attempt to solve the aforementioned problem. Essentially, LARD attempts to assign requests of the same type to the same back-end machine; it generally only reassigns the request to a different node if the current node is overloaded and an underloaded node exists to receive the traffic. The paper elaborates upon this by introducing replication, where a request gets mapped to a set of servers that increases or decreases in size based on the load it experiences. LARD also employs a TCP handoff protocol that allows the front end to examine the nature of the request and forward the connection to the back-end, transparently to the client.

Given this algorithm, the authors then run a set of simulation and real-world experiments to verify their results, using traces from Rice University and IBM to generate their loads. Both the simulations and real-world tests show that LARD has a higher throughput than the other algorithms they examined, while maintaining a low cache miss rate and low CPU idle time.

Flaws:
As others have noted, the paper’s reliance on simulation does weaken it; the actual testing is fairly perfunctory compared to the simulation studies that they test, and it is slightly unclear just how much the two agree. The paper also does not make it clear whether it accounts for the TCP hand-off in the simulation--despite their claims to the contrary, the 194 usec hand-off latency does seem significant compared to the 145 usec connection time--and its fails to substantiate its claims about scalability beyond ten back-end computers.

Applications:
Assuming that the qualms noted above do not pose problems, this seems like a useful means of refining load-balancing techniques among servers, both in the web and for general distributed computing purposes. The increasing prevalence of commodity servers makes data locality increasingly important, given relatively small amounts of memory and the potential need to request data from other servers, incurring even greater latencies than disk access would cause. Google’s MapReduce tries to ,maximize data locality when apportioning its tasks to minimize data transmitted over the network; in many ways, this is similar to LARD’s efforts to maintain cache locality.

Summary

This paper proposes a content-based request distribution strategy, namely, locality-aware request distribution (LARD), for cluster-based network servers, to provide two seemingly controversial properties: good locality and good load balancing.

Problem

This paper considers the cluster-based network environment that has a front end to distribute requests to many backend servers. There are two problems to solve:
Problem 1: How to simultaneously achieve good locality and load balancing on back end servers? There are two baseline solutions. First, round robin distribution has good load balancing property, but poor locality. Second, locality-based distribution has good locality, but poor load balancing property.
Problem 2: How to come up with a protocol that transparently hands off established client connections between the front end and the back end? The hand-off should be transparent to clients.

Contribution

To solve Problem 1, this paper first proposes a basic locality-aware request distribution (LARD) strategy. LARD is a form of content-based distribution, so it achieves good locality. LARD balances load by using the information of the load on backend servers, and shifting the load on over utilized servers to underutilized servers. This basic LAPD does a good job in balancing requests for different targets, but it is not good at balancing the same target, because it is possible that a target becomes a hot spot and overwhelms the backend server containing this target . Therefore, LAPD is enhanced with replication.

To solve Problem 2, this paper proposes a TCP hand off protocol that is not only transparent to clients, but also to backend servers. When a connection is handed off to backend servers, the front end forwards incoming traffic on that connection via a forwarding module at the bottom of its protocol stack. This protocol incurs little overhead, because it creates or destroys a TCP connection at the back-end without going through the TCP three-way handshake.

Flow

1, I would like to see what's going on when removing the assumption "front-end and networks are fast enough not to limit the cluster’s performance,".
2, What if front-end fails?
3, I also want to see some discussion on scalability and fault tolerance on the backend -- maybe the cluster is small enough (6 in the experiment), failure is an exception rather than a norm ...

Relevance

This work makes me think of MapReduce:
- MapReduce has a master, while LARD has a front end.
- MapReduce has some slave workers to run map tasks and reduce tasks, while LARD has many backend servers to response requests for targets.
- MapReduce uses hashing to partition the output of map workers, then distributes to reduce workers, while LARD could also use hashing for content-based partitioning.
- Given a large amount of map tasks and reduce tasks, MapReduce dynamically distributes them to available worker machines, while LARD dynamically reassigns targets from overloaded backend servers to underutilized servers.
- MapReduce is essentially to handle static contents stored in files, and users have chances to customize (transform/filter) those static contents. LARD handles static contents.

Summary:
This paper designs a content-based request distribution strategy, which distributes request not only considering work load of each back-end node, but also the content of these requests. This strategy can improve the performance of cluster-based network servers, since it can improve hit rates in the back-end’s main memory caches.

Problem:
In traditional cluster-based network servers, the front-end distributes request only considering current load of each back-end node, and hit rates in the main memory of back-end nodes will be very low. This will require more disk IO and lower the system performance. But by naively distributing incoming requests in a content-based manner, the load between different back-ends might become unbalanced, resulting in worse performance. This paper tries to find a strategy which can achieve both load balance and high hit rate.

Another problem this paper tries to solve is to design a handoff protocol, which is transparent to both clients and server application. So no changes are needed on the client and back-end nodes.

Contribution:
The contribution of this paper includes a strategy which combines the advantage of both content-based request distribution and load-based request distribution, and an TCP handoff protocol. Authors made a simulation evaluation on LARD strategy and performance evaluation on a prototype LARD server cluster.

Flaw:
1. Authors did not mention the caches strategy of main memory on back-end nodes. memory size, replacement strategy and some other features are all very important to the performance of LARD strategy. Authors should rely on the features of main memory to set constant number of their LARD strategy. Maybe a new version LARD strategy should include the design of caches strategy of main memory on back-end nodes.

2. The size of requested file is not considered in LARD. When requested file is larger than the main memory, the LARD strategy will perform worse. I prefer what GFS does, in which files are divided into 64M blocks, and every requests are redirected to blocks. This idea can be used by LARD to handle large file.

3. Authors use connection to count the work load of each back-end nodes. This is a easy way, but may be not precise.

4. Each back-end node holds all data, which the cluster server tries to provide. The utility of disks on back-end nodes is very low.

Why applicable in real system
Some basic ideas in system designing are reflected in this paper: making requests locality to make better use of caches,making more use of results from expensive operations, like disk IO ,keeping transparent of middle layer, and making trade-off to fulfill needs from each aspect.

Problem
Aggregating commodity workstations into clusters can be an effective way to obtain performance at a low cost. This is an approach that is often employed with network servers, where a front-end device accepts requests and distributes the processing to a number of back-end nodes. Obtaining peak performance on this type of system is highly dependent on the mapping of requests to back-end device.

One reason this mapping is important, is because the request latency and throughput will diminish if certain nodes are overloaded. Existing techniques are already adept at tackling this problem. For instance, Weighted Round Robin (WRR) will distribute requests evenly to nodes weighted by some measure of their available resources. This technique, however, is ignorant of the tendency of related tasks to share common resources. The authors exploit this phenomenon in creating their strategy called Locality Aware Content Distribution (LARD).

Solutions
The key insight behind LARD is that if multiple connections require similar files on disk, the data may be cached in main memory, and accessed more quickly. The goal is to balance placing jobs of similar type (target) to the same node for locality, with spreading out jobs for load balance. The basic proposed mechanism is that each target is assigned a back-end server where jobs of that type are processed, and this server is swapped for a lower-utilization node when the perceived load is too high. This idea is extended to handle sets of nodes for each target, by adding nodes to the set when perceived load is high, then removing them after some timeout. This prevents a highly prevalent target from repeatedly overloading servers one at a time.

Another issue is that front-ends must establish a connection with clients before being able to view the contents of the request. To enable a client-transparent redirection of the connection to a back-end server, the authors propose a TCP handoff protocol. This protocol is used in their prototype system.

Evaluation
The authors first evaluate their techniques using a custom simulation environment. The front end is modeled as a single wait queue, and the back end has a queue each for CPU and disk accesses, and also maintains a disk cache. Inputs to the simulation are derived from webserver traces from IBM and Rice University. The LARD policy is compared against the aforementioned WRR, and also against blind locality based (LB) techniques (which have no load balancing). As expected, the LARD techniques have the best throughput because LB leaves many nodes underutilized, and WRR can’t achieve as good cache behavior.

Also, the LARD techniques were implemented for an apache-based webserver. Here, LARD was found to be 2.5x faster than WRR at 6 nodes.

Thoughts/Criticism
There certainly seems to be room for improving the heuristics of the dispatcher. The measure of load is simply the number of active connections, which may be a weak assumption in the general case. Also, there appears to be no policies in place to favor disjoint server-sets. This seems like it would be another way of enforcing locality and good cache behavior.

The simulation environment is somewhat adhoc, and therefore should be taken with some reservations. In general, if a simulation model only takes into account the parts of the system which are attempting to be exploited, the results are going to be skewed in the favor of the technique in question. One possibly problematic assumption is that front-end is fast enough to handle the forwarding of all incoming data to every back-end. This might not be true, given a different workload.

Whatever reservations we might have about the evaluation, this paper has conveyed an important message: load balancing is not the only means of obtaining efficiency, effective use of resources must be considered as well.

Summary:
This paper proposed a content-based request distribution scheme, in which the request distribution doesn’t only depend on the load information of the back nodes, but also on the content of the target (requested object) to increase locality and thus cache hit rate. The author also introduced a TCP handoff protocol to enable the prior front-end inspection of the content of the request required by this content-based request distribution scheme.

Problem trying to solve:
The problem this paper is trying to solve is that the traditional weighted Round-Robin request distribution doesn’t utilize the cache of the back-end nodes effectively. Being blind to the content of the target, traditional load based request distribution would dispatch requests that ask for the same target to different, random servers, thus requiring that the whole work set fits into each node’s memory to perform well. Obviously this doesn’t scale when we have larger and larger content set being served by a single server, as the author claimed.

Contributions of the paper:
1. The author proposed the LARD (locality-aware request distribution) and LARD/R (LARD with replication) strategy to address the problem of achieving both load balancing and locality awareness. This strategy is simple enough and doesn’t require any communication to of modeling of the backend nodes.
2. In the evaluation section, the author showed potential performance improvement of using content-based request distribution using both trace-driven simulation and a prototype evaluation.
3. The author also proposed a TCP handoff protocol, that enables content-based request distribution by providing client-transparent connection handoff for TCP-based network services.

Flaws:
1. Most of the evaluation is based on trace-driven simulation, not by real experiment result, which seems even more problematic when they are assuming that front-end has no overhead and all networks have infinite capacity.
2. The paper didn’t discuss how much overhead is added to the front-end. Since the front-end is likely to be the bottleneck in this case. Although the author did mention that SMP-based front-end could scale better, it’s not backed by any evidence.
3. The author’s assumption that the working set that one server need s to serve would become bigger and bigger seems a bit artificial to me.

Relevance:
1. Today’s web is dynamic rather than static. Though this content based distribution idea might still be useful, it’s not directly applicable.
2. I am not quite sure whether the architecture this paper is assuming about webserver (one front end directly connected to multiple backend) is still the norm today. I would suspect that today the contents would be geographically distributed, or even in the cloud, making the connection-handoff based mechanism not useful anymore. Of course, we could still consider content based request distribution, but maybe in a different scenario.


Summary:

The paper describes a request distribution mechanism in cluster based network servers which distributes incoming requests such that they achieve high locality in back end servers' main memory minimizing the instances, and thereby the associated overhead, of fetching requested data from persistent storage. This is termed as Locality Aware Request Distribution (LARD). While achieving high hit rate, load balancing is shown to be not hard to achieve.

Problem:

Prior to this work, request distribution in network servers was primarily, if not solely, driven by load balancing goals; with instantaneous load on back end nodes being considered to decide which server would process the request. This strategy totally disregarded the amount of work put in by a server in processing a request. This hampered significant improvements in response time as well as throughput of the system since every server was equally probable target for any request and more often than not a request would cause the system to start over processing that request from scratch, rather than taking advantage of cache that was built up in the memory of the server that recently served that request. Thus, a server's in-memory cache was a potential resource which was still to be exploited.

Contributions:

Since LARD only uses number of outstanding request on a server to estimate its load and requires no extra communication or elaborate state information at front end, it shows that a distributed system does not have to be very complex to centralize its service distribution task (although it lacks a clear global view of the system).

LARD shows how load balancing can be combined with high locality. LARD depicts how in-memory caches at servers can be exploited to improve throughput by spending less time in fetching the requested data from persistent storage. Also, since no static assignment of requests to back end server is enforced to achieve the desired results, the system remains flexible and optimally takes advantage of available computational power. While assigning requests to back end servers, LARD ensures that there is no load imbalance in the system. In case of load imbalance, requests get reassigned to other lightly loaded servers. This load balancing in face of high locality is critical to achieve since a single hot request can swamp a server and bring it down. Authors describe a model that uses number of outstanding requests on the servers to trade off between the delay that would be experienced in case the server is heavily loaded and response time that would be seen in case the request in reassigned to another server.

Holding a single server responsible to serve a particular request at any point of time can have conflicting effects on load balancing and high locality (e.g. when a single server can't serve all the clients of a particular request ). To address this, LARD/R maintains a set of servers which can serve a particular request. Thus, in case of very high load, a particular request will be directed to a chosen set of servers. This will continue to provide high locality while maintaining load balance in the system.

With Weighted Round Robin (WRR) request distribution, the requests would benefit from server's in-memory cache as long as the size of working set does not exceed the size of memory available in each back end server. But since LARD attempts to improve the cache hit rate at back end server, the effective cache size of the system is collective memory size of all back end servers. LARD achieves the claimed high throughput and improved response time consistently as long as the size of working set is under aggregated memory of back ends servers.

The paper also describes a TCP connection hand-off protocol to transparently forward client's requests to back end servers after using the request content to take request distribution decision.

Flaws:

One major flaw in this design is limited scalability. Front end is the bottleneck in taking request distribution decisions efficiently. If the back end servers are plenty as well as powerful, and potential request domain is large then the state information that needs to be saved at front end grows significantly. This places an upper limit on how big a system can grow or how large the request domain is allowed to be. Although this problem can be subdued by adding another level of indirection and having another request distribution layer in front of front end, the proposed model does not handle the aforementioned scalability issue.

The load estimation model which uses number of outstanding connections to estimate server load assumes that all servers are of equal capacity in handling the load. While this can be enforced in a system, it fails to be generic. One more thing this model assumes is that all requests impart the same amount of load. This assumption is unrealistic and hence highly arguable .

Applications:
Optimal request distribution is certainly very desired property in systems where the penalty for suboptimal distribution is intolerable. Web services are popular example of such systems where optimal request distribution can achieve significant latency reduction. The need for Content Distribution Networks, memcache clearly shows that optimizing request distribution strategy is very critical in web services. A distributed file system can also benefit from LARD.

Summary:
This paper describes a content based request distribution called LARD for cluster based server networks that considers both the content requested and the current load on the back end nodes while deciding which node should service a particular incoming request.

Problem:
A typical load balancing system(Front end) looks upon only the current load on the back end nodes and takes a decision as which node(the lightest loaded node) will get to service a request. However merely looking upon this criterion could lead to cache misses as even the entire working set may be new to the node. This would cause additional requests to the disk and hence the time of processing requests increases many folds resulting in undesirable throughput.

The authors of the paper suggest that in addition to the current load on the back end nodes, one should also look into the locality of the content for distribution so that the system can limit the disk accesses due to increased hit rates in the back end's main memory. Thus, the overall throughput of the system is assured to go up drastically.

Contributions of the paper:
The paper presents a strategy called LARD(Locality aware request distribution) that intends to improve cache hit rates in the back ends. There are 2 clear cut goals: a) to improve load balance b) to strive for better locality in the main memory. if the system was to use a weighted round robin distribution scheme it would have ensured good load balancing but each back end node is equally likely to get more or less the identical working set and if its greater than main memory for caching documents,cache miss will occur. To improve the locality in back ends cache, a hash based partitioning of working set distributes evenly among the nodes.

The authors describe a basic LARD strategy wherein the front end maintains an entry for each target. A server is assigned for each target. and if the load (number of connections serviced) crosses a threshold the mode with lightest load which is idle is selected else if load is twice that of a threshold where delays of processing are liekly to be seen , the lightest load node even if its not below the idle threshold is selected. Also to prevent LARD behaving like weighted round robin, total number of active connections to the cluster is limited.

As a small extension to the LARD strategy instead of preventing overload on a single back end node for a particular target, a set of servers is assigned to a target and depending on the need , servers are added and removed from that set.


Deficiencies:
In my opinion, the largest negative aspect of the paper lies in its evaluation. They have demonstrated the efficiency of the algorithm through SIMULATION. They did not attempt to convince the correctness of the simulation model and how well it "simulates" the real world scenario. Also, they have illustrated the performance for two specific input traces which raises some serious doubts. The performance of the LARD strategy has to be evaluated for more diverse workloads. The numbers in the graph dont really mean anything to us. An real experiment with a cluster of just 7 clients and 6 back ends seems just not enough. At the least they could have proved that this experiment is sufficient for quantifying the performance of LARD. There was definitely more scope for better evaluation. Also, with many requests being served in a single TCP connection(persistent http) being a normal scenario, the TCP handoff described in the paper does not deal with it at all.

Lessons Learnt:
Load balancing is an integral part of any contemporary distributed system right from facebook to amazon. It is very evident that disk requests can typically stall todays scalable systems. So a design of the load balancer system incorporated with locality aware distribution so that there are more hits to the main memory can make the whole system CPU bound than disk bound. By limiting disk accesses and improving cache hit in main memory while balancing the load among back end nodes, LARD is indeed a great lesson to the designers of modern distributed system in improving the overall throughput of today's systems of great scale.

Summary:

This paper proposes a content locality aware request distribution (LARD) algorithm for cluster based network servers to achieve both good load balancing and performance
(cache hit ratio, throughput, etc). It also presents an efficient TCP handoff protocol
to provide client transparency during request re-distribution.


Problems:

1. How to distribute the requests to back-end servers with both good load balancing and performance ?

2. How to implement request re-distribution framework transparently and efficiently ?


Contributions:

1. Clearly present the problems needed to solve, the cluster based network server framework, and where is the challenge.

2. Propose new algorithms LARD and LARD with replication to provide both load balancing and better performance over the WRR and other related algorithms.

3. Discuss the motivation of TCP handoff and a simple framework how TCP handoff works.

4. Conduct both real-world trace simulations and prototype implementations in experiments. The experiments prove that LARD is better than other state-of-art algorithms.


Flaws:

1. The LARD algorithm is a heuristic algorithm. However, the authors did not investigate the parameter space during the experiments. For example, they even did not try to vary the T_high and T_low in any of their experiments. It is hard for me to believe the value of T_high and T_low they choose are optimal for all workloads.

2. The LARD framework is a standard queuing system. Utilizing queuing and control theory may generate a better heuristic or optimal algorithms. It is a pity that the authors did not explore or discuss in this direction.

3. They assume that the front-end processing has no overhead. Ironically, their experiments show that the front end needs to handle extensive packet forwarding workload. (in Section 5, to fully utilize 100Mb/s network, a single back-end node will generate 4128 acknowledgment packets per second for the front-end).


Applicable:

LARD is not very applicable for current datacenter style environment.

1. The network server framework is significantly different now. For example, Facebook's photo storage server (Reference Facebook's OSDI 2010 paper) uses the CDN as its first level of service framework. If CDN has cache miss, then the requests are routed to the back-end. For this kind of multiple level service framework, the request pattern, cache hit/miss pattern, load balance, are more complexed and harder the the simple framework of LARD.

2. The assumption of all data fits in one node is not true anymore. In current "big-data" era, the data set may be partitioned into multiple nodes for performance and fault tolerance. Typical examples include GFS, Facebook's Haystack photo storage, etc.

Problem:

The paper talks about client request satisfaction strategies implemented by a cluster based network server with a front end representing the cluster. The cluster consists of several back-end servers which can satisfy a client request. The main techniques for satisfying this problem at the time of this paper were

1.>Load Balancing(Weighted Round Robin)
2.>Locality Based techniques

Both these techniques are mutually exclusive. Load Balancing distributes the load but there are more cache misses.Locality based approaches have less cache misses but the load distribution is skewed. The authors of this paper came up with a method to emulate the effects of both in order to enjoy the benefits of each technique.

Therefore the paper has two main goals. The first is to have load balancing of client requests along with high cache hit rates (locality) at the back-end servers. The second is to have an efficient hand-off mechanism for a connection between a client to a front-end server to a back-end server. The front-end server hand-offs the request to the back-end server.

Solution:

In order to combine load balancing and high locality the paper proposes a solution called LARD (Locality Aware Request Distribution). In this solution the front end maintains a one to one mapping with of targets to back-end nodes. When a client request arrives for a particular target it is forwarded to the respective backend node. In this way we get high locality. The effective cache size is the sum of all the cache sizes of all the backend servers. This helps in more effective use of memory and lesser cache misses.

In the case when there is a load imbalance the target is assigned a new back-end node with lesser load. In this way we get load balancing. Therefore we get a solution which has a foot in both worlds. The solution is further optimized by providing for a replication in which requests for a given target are distributed over a set of back-end servers. In this way no single server is overloaded and provides for further load balancing. In order to implement hand-off they use a hand-off protocol that runs on top of TCP. It runs in the front-end and the back-end nodes and is transparent to the client. The front-end does effective Ack forwarding from the client to the back-end server.

All these above techniques provide for good speedup and improve the throughput for the client.

My Views:

I feel this paper is of significant importance. The LARD solution is a well thought out solution which incorporates the benefits of two methodologies that conflict with each other. However most of the experiments have been done using simulations, I would like to see more real world experiments in support of their claim. I also think there were a few minor design loop holes, namely:

1. There is only one front end server, it is a single point of failure, having more than one front end server can provide for greater availability and reliability.

2. Also all acknowledgments from a client pass through the front-send server. This could create a bottleneck at the front-send server. Since a back-end server can talk to the client once a hand-off is done from the front-send server, the client could send its ACKs to the back-end server.

This paper presents “Locality Aware Request Distribution” algorithm a combination of content based request distribution algorithm (which can improve locality in back end’s main memory caches) and load balancing algorithm. Also it presents a TCP Handoff protocol which aids in implementing above algorithm in real time.

As the Web Caching paper quotes - “The key performance of World Wide Web is the speed with which content is served to users.” Since Web Servers are major hubs for the information dissemination and thousands of users request for data at the same time, it becomes critically important to serve these requests (and use the system resources) in an efficient manner. Also since there is no predefined request pattern, there is a need for a request distribution algorithm to dynamically adjust with respect to changing circumstances.

Previous approaches to request distribution in cluster servers mainly focused on load balancing. But there is a scope for performance improvement by taking Locality (Content) into consideration (to increase maximum cache hits) . But Locality alone will not give us an optimal solution since it may lead to highly unbalanced system. So the present paper takes into consideration both locality and load distribution to device an algorithm “LARD”.

The major contributions of this paper are:
1) This algorithm takes into account both load balancing and content of the request for request distribution.
2) The idea of dynamically allocating and de-allocating nodes as and when needed in LARD/R algorithmm and setting of T(high), T(Low) for optimal performance between delay and load balance.
3) Inclusion of a front end server which has a valid global view of the system. It can implement a globally efficient request distribution algorithm
4) The idea of aggregation of cache size by content based request distribution which makes it scalable.
5) Design of TCP Handoff protocol


Flaws:

1) The three goals of author 1) locality aware request distribution 2) Secondary storage scalability { by partitioning data} 3) special purpose back end nodes - It does not seem to be a distributed system but rather a clusters of nodes clubbed together to achieve above three goals.
2) Front end can become a single point of failure. It breaks the norm of having a decentralised control. Front end server’s total capacity may not be utilised.
3) Also TCP Handoff doesn’t seem to scale. Even though the handoff overhead is less, it will incur an extra cost.
4) Also author talks about special purpose backend nodes in the introduction, the same approach may not work for it. Present approach load balances by dynamically allocating and de-allocating nodes when and where necessary, but if we have special type of nodes - load may not be balanced properly.
5) LARD seems to be good for a small scale distributed system ex: a system for a organisation, But I don’t see it scaling to large size distributed system. Take google search engine for example, how many people do similar queries in a day and can these similar queries be routed to same server to answer and for how much time the data has to stay in cache?


Applications to Real World Systems :

1) During Compilation, knowledge about the content of the program (such as type qualifiers, temporal locality , spatial locality...) can help in optimizing the generated code.
2) We can use similar approach to schedule data requests on RAID. Requests regarding same file can be grouped together or a request can routed to that particular disk which can service it from disk cache.
3) Content Based Page Sharing : Pages are shared across different virtual machines.(It is done by calculating hash of the page contents)
4) Content Aware page scrolling : These days scroll bars are designed depending upon the content and layout of the document.

Post a comment