« Locality-Aware Request Distribution | Main | Web Search for a Planet: The Google Cluster Architecture »

Web caching with consistent hashing

Karger, D.; Sherman, A.; Berkheimer, A.; Bogstad, B.; Dhanidina, R.; Iwamoto, K.; Kim, B.; Matkins, L.; Yerushalmi, Y. (1999). "Web caching with consistent hashing". Computer Networks 31 (11): 1203–1213.

Review for this or request distribution due Tuesday, 1/25.

Comments

This paper describes a caching system for Internet content that does not require central coordination, making greater scalability possible. The problem faced by the Internet is that clients getting data from original providers is often bad for a couple reasons. First, clients may be physically located far from the provider, so data has to be sent great distances, leading to excessive network congestion and longer than necessary delays. Second, popularity (and hence server load) is unpredictable, so it is not practical for all data providers to prepare for peak loads.

The paper proposes using caching servers near the clients. Given the size of the Internet, a great number of servers is necessary, so coordinating interactions between servers won’t scale. Thus, the paper suggests rules (consistent hashing) that determine who is responsible for what content. Also, two ways for clients to find the appropriate caching server are described and implemented.

The “ideal” approach suggested is to have clients themselves run specially configured browsers that follow the rules for finding correct data. This pushes the work to the greatest number of machines. Unfortunately, not all browsers can be configured, and it would be nice for all systems to take advantage of the caches by default, so an alternate approach is described that pushes the functionality to the DNS servers. While this practically integrates with existing technology, this second approach unfortunately causes the work to be performed more centrally. Additionally, (if I correctly understand DNS), content from the same site will have to be cached by the same server. For example, www.youtube.com/video1 and www.youtube.com/video2 would need to be cached by the same cache server since DNS deals with domains, not sub directories on servers (please correct me if I’m wrong on this), and that would obviously be problematic.

The ideas in the Amazon Dynamo paper (such as the consistent hashing for distributing storage responsibility among a changing number of servers) seem to borrow heavily from this work.

The main ideas of this paper that can be applied to distributed system design in general are: (1) use pre-defined rules rather than coordinating effort later to reduce communication and increase scalability, and (2) be willing to compromise the idealistic design in order to integrate with existing technology.

This paper describes a caching system for Internet content that does not require central coordination, making greater scalability possible. The problem faced by the Internet is that clients getting data from original providers is often bad for a couple reasons. First, clients may be physically located far from the provider, so data has to be sent great distances, leading to excessive network congestion and longer than necessary delays. Second, popularity (and hence server load) is unpredictable, so it is not practical for all data providers to prepare for peak loads.

The paper proposes using caching servers near the clients. Given the size of the Internet, a great number of servers is necessary, so coordinating interactions between servers won’t scale. Thus, the paper suggests rules (consistent hashing) that determine who is responsible for what content. Also, two ways for clients to find the appropriate caching server are described and implemented.

The “ideal” approach suggested is to have clients themselves run specially configured browsers that follow the rules for finding correct data. This pushes the work to the greatest number of machines. Unfortunately, not all browsers can be configured, and it would be nice for all systems to take advantage of the caches by default, so an alternate approach is described that pushes the functionality to the DNS servers. While this practically integrates with existing technology, this second approach unfortunately causes the work to be performed more centrally. Additionally, (if I correctly understand DNS), content from the same site will have to be cached by the same server. For example, www.youtube.com/video1 and www.youtube.com/video2 would need to be cached by the same cache server since DNS deals with domains, not sub directories on servers (please correct me if I’m wrong on this), and that would obviously be problematic.

The ideas in the Amazon Dynamo paper (such as the consistent hashing for distributing storage responsibility among a changing number of servers) seem to borrow heavily from this work.

The main ideas of this paper that can be applied to distributed system design in general are: (1) use pre-defined rules rather than coordinating effort later to reduce communication and increase scalability, and (2) be willing to compromise the idealistic design in order to integrate with existing technology.

Karger et al. utilize consistent hashing to distribute load amongst multiple web caches and provide a larger effective cache size.

Previous approaches to distributed caching of web documents potentially require communication with multiple caches to determine if an object resides in a cache. Objects were not limiting to only a few caches, meaning an object might exist in every single cache. This redundancy and the need for each machine to potentially cache any object resulted in a system-wide effective cache size equivalent to the cache size on a single machine. Employing a hash function to assign each URL to a specific cache helps alleviate the issue of a single request requiring communication with multiple caches. However, a standard hash function needs to be changed when adding or removing a machine, resulting in a reassignment of URLs to different caches and effectively invalidating all existing cache entries.

Consistent hashing avoids the need to communicate with multiple caches by assigning specific URLs to specific caches. The system can be thought of as mapping each URL to a point on the unit circle. Caches are also randomly mapped to multiple points on the circle. A specific URL is expected to be cached on the machine that is next encountered when traveling clockwise around the circle from the point where the URL was mapped. Adding a cache requires mapping a new machine to the circle. The new machine "steals" some URLs from the next cache on the circle, but the remaining URLs all remain assigned to the same cache. In summary, consistent hashing provides the benefits of: 1) relying on only a few caches to be responsible for each URL, improving the effective cache size of the system, and 2) facilitating the easy addition and removal of caches without requiring all URLs to be reassigned to different caches.

An actual implementation of the system requires the use of Domain Name Servers (DNS) to allow browsers to determine which cache they should contact for a given URL. A hash function written in JavaScript is used at the client browser to map the URL to one of 1000 virtual caches. The DNS server maps the virtual cache to an actual cache using consistent hashing and status information from all caches. Although using DNS helps solve limitations in the client's browser, using DNS has a few shortcomings. First, DNS entries can be cached at clients or other DNS servers, allowing old mappings from virtual caches to physical caches to persist for a while even after the assignments have changed. Fortunately, consistent hashing allows for multiple views of the caches to be held simultaneously. The point on the circle to which a given URL maps does not change, but the presence or lack of a cache on the circle (depending on the age of the DNS records) may result in needing to travel farther "around" the circle before finding the appropriate cache. Some redundancy in caches may result, but the system-wide effective cache size is still larger than with previous approaches. The second issue with DNS is the latency of performing a DNS lookup. Even if both the DNS and the caches are in the local network, receiving a file from the cache requires at least two LAN RTTs: one for the DNS lookup and one for receiving the item from the cache. Depending on the RTT time to the server where the original file resides, the added cost of the DNS lookup may result in longer latencies than contacting the original server directly.

Consistent hashing has applications beyond web caches. Systems performing computations or transactions, e.g. online purchases, can load balance user requests amongst multiple servers using consistent hashing. Also, the approach Karger et al. propose is very similar to the distributed hash tables being employed in recently proposed content naming systems, e.g. DOT or CHORD.

This paper describes a distributed web caching system where clients determine the cache node that contains a given page by using hashing, which improves performance and reliability for a single query. This scheme maintains high performance in the presence of changing availability of cache nodes by using a consistent hash that always restricts a given page to a small group of nodes.

Web caches using a single server have limited effectiveness because a limited storage area leads to falsely denied requests for pages that were recently thrown out of the cache, while separate caches for small groups of users are of limited use because the pool of requests with which to build the cache is smaller. Some existing distributed web caches have a high request latency because the client must either query a central directory to find the correct cache for a given pages, or send a multicast request and wait for responses from every cache to see if it has the desired page. Finally, a client could determine which cache node to use by hashing the URL. But if not implemented properly, a hash could map all URLs to different caches when just one cache node becomes unavailable.

This paper applies a consistent hashing scheme to create a distributed web cache. Even when individual cache nodes become unavailable, a consistent hash will still map most URLs to the same cache nodes so that most of the distributed cache will remain valid and balanced. The hash works by applying some ordinary hash to URLs, then using this hash to map the URL to a point on a unit circle. Cache nodes are also evenly spread around the circle. A client picks a cache by finding a URL's point on the circle and then choosing the next cache that appears counterclockwise from that point. While current browsers cannot implement a full consistent hash scheme, browsers can map URLs to a large, fixed set of virtual cache nodes. The authors implemented a DNS-based system that maps virtual caches to real caches using consistent hashing and current availability of caches. Comparisons against three existing cache systems verified that a hash-based approach incurs lower latency because there is no inter-cache communication on a miss.

An important property of this distributed cache system that would be valuable in a real-world setting is the ability to dynamically add new cache nodes, so that the system's popularity does not lead to its own demise. One point that is not clear is how the number of virtual caches was selected. This number may eventually limit the number of real caches that can be used because one virtual cache name may be shared between too many caches. Also, the load balancing proposal for popular pages only works at the granularity of virtual cache names, which could be problematic if two highly popular pages are hashed to the same virtual cache, where effectively distributing the load between multiple caches would be difficult. To prevent malicious users from manually requesting pages from arbitrary nodes, physical caches would need to determine that they could never cache a particular page under any view, and it is not clear that this guarantee is possible.

Summary:
This paper presents Cache Resolver, a distributed Web caching system that uses consistent hashing to cache pages in one or a few caches. Consistent hashing distributes pages to caches in a load-balanced manner, increasing cache hit rates and decreasing latencies as compared to other Web caching approaches.

Problem they are solving:
This paper addresses the issue of where Web pages should be cached. Caching is desirable to increase efficiency and reliability of data delivery over the Internet. Using a single cache location isn’t sufficient because it is unlikely to be able to service many users effectively and is likely to experience high miss rates. Previous work has used a “cooperating caches” approach, which functions like a two-level cache system. On each request, the primary (local) cache is checked. On a miss, all secondary (cooperating) caches are checked. This approach can be problematic because the client may have to wait for all cooperating caches to respond before either returning a page or going to the content server and because there may be a high level of data duplication. Cache Resolver alleviates these issues by removing the inter-cache communication and coordinating caching to prevent unnecessary duplication (though there is intentional replication in this system), allowing for increased overall storage.

Summary of paper contributions:
Cache Resolver provides an efficient way to decide where pages should be cached and for efficient retrieval. Consistent hashing is used to map a URL to a particular cache. Using a random universal hash function with certain properties, the string is mapped to [0,...,M], which is then converted to the range [0,1]. Each cache is mapped to a small number of locations on the unit circle, and pages are then mapped to the closest cache CW from their location on the circle. Assuming that clients do not have a complete view of the system, such that it knows of only a subset of the available caches, each URL will be mapped to a small number of caches. As the caches enter and exit the system, only a small number of pages are remapped--those for which that cache was the closest CW on the circle,-- making the system flexible to a dynamic environment. Consistent hashing also aids in load balancing, since a good hash function will distribute pages randomly around the circle. Caches are also copied to several different points on the circle to aid in load balancing.

On a client request, the client browser computes a standard hash of the URL to a range of virtual caches. DNS then maps this virtual cache name onto an actual cache IP address using consistent hashing. Their results suggest that this approach is superior to a standard primary/secondary cache approach, both in cache miss rate and also average request latency. The authors suggest using geographically close caches and hot pages to further decrease latency and provide better load balancing.

Flaws:
The largest flaw in this paper is Cache Resolver’s reliance on DNS for virtual cache name translation. Because browsers are not flexible enough to support consistent hashing, this functionality is outsourced to DNS. Although Section 3.3.1 attempts to justify this decision, it seems problematic to alter DNS in this manner; instead to isolating the burden at the client node, work has been passed to DNS.

How the ideas in the paper are applicable to real systems:
Using consistent hashing allows the caching approach to be robust as the system changes. In any Web caching environment, servers will enter and exit the system; an approach that must remap all pages in such an event will likely have very high miss rates each time the system is augmented. Consistent hashing also provides an important level of availability--web caching will continue to work even if the system changes drastically in size. As long at least one cache exists, some amount of Web caching can be done. Consistent hashing is used in other distributed environments where availability is of paramount importance, including Dynamo, Amazon’s key-value store.

Summary:

This paper proposes a novel design for a web cache network that uses a hashing technique, consistent hashing, to provide users a means to efficiently locate the correct cache for a URL while also evenly utilizing each cache in the network.

Problem Description:

Their web cache design attempts to solve the problem of delay and failure in data transmission on the Internet. This problem is important to solve because degradation in performance in both of these aspects greatly affects the usability of the Web. The authors discuss that previous work performs well in the optimal case, but fails to scale when conditions are suboptimal. One solution, a cooperative cache, introduces significant overheads when a cache needs to consult other caches due to a cache miss. Another solution, a centralized directory cache, also introduces extra overhead with directory queries and can also be a single point of failure for a web cache network.

Contributions Summary:

The proposed solution improves on the previous work by eliminating cache-to-cache communication in the web cache network while still mitigating the delay and failure problems. There are a few different pieces of their solution that combine to make this happen. The first is that the client uses a hash function to map URLs to the caches that contain the URLs. The invariant here is that if the mapped cached doesn’t have the data at the URL, another cache is not consulted for the data. The second piece is a consistent hash function, which basically is a hash function that is robust to noise in the number of caches known to the client. The result of the consistent hash function is that regardless of the caches known to the client, the hash will always map a URL to a bounded set of caches. The authors contribute a set of experiments that demonstrate improvements over cooperative caches and directories for a small experimental setup.

Flaw:

I think the largest flaw appears to be in their experimental setup: the test driver runs the clients on a single machine. When they say a single machine, I assume it has one interface to the network. If this is actually the case, the multiplexing of the clients on this single interface introduces an artificial bottleneck into the experiment that could mask any simulation of a load generated by multiple clients.

Application to real systems:

In the presentation of their implementation, the authors discuss that they had to place the consistent hashing functionality in DNS servers instead of in the users’ browsers. I think the authors understated the benefits of placing this functionality at the DNS servers.

First, placing this functionality in DNS allows the network and server administrators to manage the consistent hashing functionality with little, if any, user involvement. Second, I think placing functionality in DNS would allow for more transparent fault-tolerance because the user’s browser would not be responsible for updating the list of virtual names.

With its utilization of DNS, it seems to me that the proposed web cache system is a predecessor to content-delivery networks, and content-delivery networks represent the extension this technique of using DNS to do load-balancing and improving locality to its logical completion.

Web caching is an idea used in computer networks to reduce the load on the network by storing a piece of information requested by a node and providing it locally on subsequent requests by other nodes in the network. This paper describes use of consistent hashing in development of a web caching system that is more balanced, scalable, fault-tolerant and has lower latency than other caching systems.
A distributed web caching system does two important functions; it identifies which cache node(s) to use to store the data and it locates the cached node(s) which contains the piece of data requested. Earlier work described in the paper used a ‘cooperating’ scheme of cache hierarchy that fixed a client to one or more primary cache nodes. If the requested data was not present in the cached node then primary node could either broadcast all the other nodes, or contact a fixed set of secondary cached nodes or use a central directory service to find out about the availability of the object requested anywhere in the caching system. Use of broadcast puts a pressure on the bandwidth available in the network and a directory service can easily become a central point of failure. Also data duplication will use up lot of the available storage since same object could be cached at different nodes. This leads to the idea of using hashing as the means of storing and retrieving the objects in the set of caching nodes.
Clients perform a hash on the URL of the object and based on the previous assignment of hash value mapped to a caching node, the caching node can be requested with a unicast to present the requested object or to fetch it from the web. This trivial use of hash can quickly lead to problems in two scenarios. The mappings are based on the known fixed number of caching nodes which can change in a real system easily. Addition or removal of nodes from the system will cause invalidations of all the previous mappings and the whole system will have to start building the cache from scratch. Secondly in a real situation it is not guaranteed that all the clients have the same view of all the caching nodes available in the system. These differences in views can lead to unbalanced distribution of objects across the system overloading certain caching nodes more than others. This leads to the idea of using consistent hashing.
The idea of consistent hashing is to locate the closest bucket rather than locating the exact bucket. This is done by mapping all the objects and the caching nodes on a unit circle. The caching nodes which are closest clockwise are used to store the objects. To provide more uniform distribution, copies of the same node are mapped to random different points on the circle.
To implement their scheme, authors had to push the consistent hashing feature into DNS resolution rather than a native support in the browser, which was difficult at that time. The URL of the object mapped to a virtual name, which mapped to a real IP address of the caching node possibly holding the object. This extra level of indirection allowed them to incorporate the dynamic nature of a real system in the DNS resolution server. Their measurements include hit rate and avg. request latency of different web caching schemes. It is seen that consistent hashing performs well for both the metrics for different sizes of cache as compared to other systems.
There are few things that bother me. Figure 2 was slightly unjustified since they did not really use the two levels of cache hierarchy for common mode. Secondly the miss rate and latency measurements were not done on real data, rather was done using artificially generated requests from a set of 1500 files.
I like the fact that they went ahead and made another use of the DNS scheme to show that locality can also be taken care easily although it does not fit in so smoothly with the whole idea of consistent hashing. Also it was nice to see that they knew that some content is invariable more in demand than others and can overload even the best hashing scheme. This whole paper demonstrates nice use of hashing to show how distributed systems can achieve load balancing and fault tolerance.

The paper proposes a Chords like consistent hashing technique for web caches. The paper suggests its solution having significant advantages the existing web caching solutions which involve message broadcasts and large wait time in cooperative caching and points of system failure and bandwidth requirement in case of directory services. Such advantages exist due to hashing function which can figure out the cache location reducing the lookup timings and large amount of message passing.
The paper further proposes to use consistent hashing instead to allow only small movement of cached data on addition of new cache locations. Consistent hashing also avoids problems of multiple views. The consistent hashing scheme maps the caches and the urls to fixed points in a unit circle. And a url is caches in the next cache location from its position when moved clockwise in the unit-circle. In this way, in case of addition of a new cache location cause only a few handoffs of urls from one cache location to the new one. Also, multiple views are significantly lower in number in case of consistent hashing as only a few nodes can be near the cached url.
The system realizes and finds it hard to implicitly build in consistent hashing in web browser. This is due to the dynamic nature of the caches in the ring. They might enter and leave the network all the time. However, the browsers code might be static once downloaded. In order to keep the dynamic updates of the caches and their position in the ring, Domain Name Service is used to figure out the appropriate caches.
Flaws: As the system intends to keep the caches in diverse geographical locations, I wonder if it would cause a lot of network traffic when nodes enter and leave the system. This is due to the transfer of responsibility of a few URLs from an existing node to a new node or from a leaving node to an existing node. Also, modifying the functionality of DNS makes it hard to be integrated to the existing system as the solution isn’t backwards compatible and such changes are hard and can impact other web elements.
The Application to real world systems: Consistent hashing systems are capable of fault tolerance as it allows caches entering and departing the ring, yes effecting only a few movements on the data with such departures. This makes the system highly available.

Summary

Content delivery on the Web is vulnerable to failures and delays if we rely solely on originating content providers to serve requests. Caching systems can improve the performance and robustness of request delivery but introduces the problem of distributing data and request load among the caches. This paper introduces a consistent hashing technique that achieves load balance across a dynamic group of caches and minimizes cache misses.

Problem

The central problem addressed in this work is that content delivery on the web is unreliable if clients can only retrieve objects from their originating content providers. Instead faster and more reliable delivery can be achieved by first contacting a group of caching servers that are closer to the client. Previous Web Caching systems using cooperating caches suffer from inter-cache communication overhead. They can also result in excessive data duplication among the caches because there is no global method of locating the correct cache for a given object.

Contributions

The system proposed here pushes the responsibility for deciding which cache on which to locate a resource to the client via a simple hash calculation. Each cache is mapped to a point on a unit circle, likewise any URL is mapped to the same circle. The 'correct' cache for the URL is simply the nearest cache clockwise from the URL. No inter-cache communication, or congesting multicast messages are necessary to locate a cached object. Because the hash function distributes caches and URLs uniformly over the circle it ensures that the assignment of objects to caches is load balanced. Also adding and removing caches is simple and only affects the assignment of a few objects to caches, the remaining caches and their contents remain untouched. Further, since any client will locate a URL at the same cache, those contents will not be duplicated in any other cache.

Although the consistent hashing responsibility could be handled entirely in a client browser, at the time of writing the authors found current browsers insufficient. Instead in the browser they map URLs to a range of 1000 'virtual servers' and forward the virtual server name to a DNS service. The DNS service in turn handles the actual consistent hashing and maps to an actual cache IP address. This use of a DNS service has the nice result that the client does not need to know anything about this proxy caching business. I can simply operate as usual with the caching system transparently operating in the background.

Flaws

Although the consistent hashing scheme laid out here achieves load balance in the sense of evenly spreading resource objects over the pool of caches. The fact is that some resources on the web are much more popular than others. A single or a few resouces may in fact be orders of magnitude more popular than all others and for short periods of time. The authors do present some ideas for responding to hot pages, but thay are not elaborated upon and no performance evaluation is offered for these cases.

Also, web caching in general is very useful for static resources that are mostly read-only and infrequently updated. Today's web, however, is increasingly dominated by dynamically generated content that may depend on session context and SSL connections. How caching can be used for this type of content is not clear from this particular paper.

Applicability

The work is highly applicable and relevant to today's Web where improving response time translates directly into dollars for many large enterprises. Implementing a system like the one laid out here is still plausible for large networks with many users like a university or company or even an ISP. But the techniques also seem to have presaged the Content Delivery Network industry which also uses DNS techniques for transparency, localization and load-balancing.

Summary:
The authors of this paper present a method of caching web-based content through consisting hashing. Consistent hashing allows clients to intelligently decide where to look for data on the network, which improves performance by increasing cache hits and decreasing delays.

Problem:
As the amount of data on the Web and number of people accessing it grows, it becomes more and more difficult to efficiently deliver data. One technique that has been used to attempt to improve data delivery is hashing, wherein machines that are close to a client machine will cache data from a server machine. In this manner, even if the server becomes overwhelmed, the client can still retrieve the cached copy. However, simply caching data on a single machine has several limitations: data flow is still limited by the number of users the caching machine can serve, storage space on a single machine is limited, and machines entering or leaving the network can cause problems. One solution to (some of) these problems is to use multiple caches, but this presents further problems: keeping various caches consistent, duplicating their data, and long cache miss times.

Contributions:
The main contribution of this paper is the implementation of a web caching system that uses consistent hashing. Their approach involves mapping URLs and caches to points on a unit circle. URLs are stored in the cache that they are closest to on the circle (“closest” being defined as the first cache you encounter when moving clockwise from the position of a URL). In order to avoid clumping up of nodes on the circle, they make copies of each cache and spread them out around the circle. This solution has several nice properties. When a cache enters or leaves the system, only a small number of URLs change, as only the URLs near the cache in the unit circle will be affected. Furthermore, since a relatively small number of caches will ever hold an item, there will not be a large number of caches that are responsible for one item.
Another interesting contribution of the paper was their novel use of DNS for finding caches. Since browsers can't handle the task of determining the location of a particular piece of data, they have the browser perform a standard hash of the URL to a list of virtual caches. They then use DNS to go from these virtual address to the physical IP addresses through consistent hashing. They spent a lot of time defending this decision, but I thought it was a perfectly legitimate use, especially given the computing environment they were working in (i.e. browsers unable to handle the task alone).

Flaws:
I thought one flaw in this paper was their test setup. The internet is an extremely complicated thing to simulate, and though they made a good effort, I'm not sure it captures the diversity and vastness of the internet. Another aspect of the paper I thought could have been improved was their discussion of advanced load balancing through hot pages. The idea of finding which servers are experiencing heavy load and dealing with that accordingly seems important, but I felt like they didn't do it justice. I would have liked to see some more analysis of their approach here. Why is it such a difficult problem? Did their solution work well? What else did they try, or could they have tried?

Application to real systems:
I was somewhat torn on this. On the one hand, I thought there were some cool, interesting ideas here. However, this paper was published in 1999, and many of their assumptions are relevant to the computing environment of 1999. Browsers were small and couldn't support much (thus browsers couldn't support the task of finding nodes), and the internet was much smaller (thus the amount of data and traffic on the internet were of a much smaller magnitude). I think things have changed enough since then that some of these techniques might not be as relevant as they used to be. Browsers today have way more capabilities than they used to, so might be able to handle the process of finding where data should be cached. Even more importantly, the amount of data stored on the internet has increased exponentially. Given the number of requests and amount of data stored on a large network, like Facebook or Google, I find it hard to believe that the techniques presented in this paper could handle loads of that magnitude.

Summary
The paper introduces a distributed web caching system by making use of consistent hashing scheme to reduce cache misses and avoid inter-cache communication overheads. Moreover, the system performs well on the issues of locality, load balancing and fault tolerance.

Description of the problem
Existing works on web caching system (cooperating cache) have the following problems:
1. Broadcast queries and transmissions of cache data consume bandwidth since queries and duplicate data need to be transfer between primary cache and cooperating caches.
2. Communications between primary cache and cooperating caches can slow down the performance on second-level cache misses. Primary cache must wait for all cooperating cache to response before contacting the content server.
3. Data might be copied from one cache to another, which duplicates data among caches. These copies evict existing pages from the cache and increase cache misses.

Summary of contributions
1. Consistent hashing is used to enable clients to determine by themselves which cache to fetch the required data directly. The main idea is that by choosing a standard base function, both URLs and caches are mapped to a unit circle where a URL is assigned to the cache which is the first cache met by the URL when it moves clockwise on the unit circle. Under this scheme, it's faster to discover misses and inter-cache communications are avoided. With the help of consistent hashing, load can be distributed uniformly among caches and failure of a cache affects only a few of URLs.
2. Introduces virtual cache names to hide the changes of physical proxy caches. Instead of generating the actual proxy caches by the script in the browser, virtual cache names is generated and they're translated to actual cache IP addresses by the DNS servers with consistent hashing implemented.
3. Two-layer DNS system is provided to improve locality and load measure on each cache is tracked to use more caches for 'hot' items.

How the ideas are applicable
The idea of consistent hashing has a good property of decentralization, which means that there is no central node in the system. All caches are at the same priority. In this way, no central coordination is needed and failure of any cache will be far less serious as failure of central node in the centralized system.
In addition, consistent hashing is scalable. Joining or leaving of a cache only affect the URLs whose hash values are closest to the cache's hash value. The contents in the most of other caches won't be changed. Moreover, the mapping of URL to the cache is only O(log n) time where n is the number of caches, which is very efficient even with a large number of caches.
Decentralization and scalability make consistent hashing applicable in distributed file systems like Dynamo as a technique to partition and replicate data.

This paper presents a Web Caching solution to improve performance while delivering content on the World Wide Web in the form of a Cache Resolver which uses Consistent Hashing. The use of consistent hashing to resolve to a given cache for a page not only reduces inter-cache communication and cache miss rates but also provides good load balancing even in the presence of changing user views.

The basic problem addressed by the paper is that of resolving to a cache node which contains the required page with least overhead and also distributing the pages uniformly among the caches to achieve better load balance. The use of a single cache which has fixed storage not only leads to false misses , but also is limited in the number of requests it serves. The use of “cooperating caches” still has the overhead of higher bandwidth due to broadcast queries and secondary cache misses. Further problems in cooperating caches arises when data gets duplicated across the caches and these copies remove other pages, decreasing cache hits. Hashing can be used to compute the cache containing the required page, improving the response time due to the reduced communication among the caches. The inherent randomness in a hashing function balances requests across the caches, removing redundancy. The use of consistent hashing further provides better load balance, even when caches are added to the system or become unavailable.

The system presented - “Cache resolver” consists of the following: caches, browser mapping to virtual caches & DNS. Each request given to the browser maps the URL to a range of 1000 virtual caches based on a standard hash function. Though the authors initially propose to place the cache resolution inside the browser, they separate it out and implement it as a DNS.These virtual caches are then resolved by the DNS to the actual cache machine IP addresses using consistent hashing. Consistent hashing uses a standard hash function that maps every URL to a point in the unit circle. Each cache is also assigned to some point in the circle and finally each URL is assigned to the first cache point nearest to it while going clockwise. An implementation of consistent hashing using binary trees for n caches has very low overhead of O(log n). The placement of URL and cache points in the circle is random due to the use of a hash function. Hence, when new cache points are added, due to the inherent randomness, only few URL’s change and are mapped to the new cache. For a similar reason, only a few caches become attached to a given URL point during multiple user views.

The test results comparing the Cache Resolver to other web caching techniques show their better performance in terms of both cache miss rate and average request latency. Some extensions like using geographical locations to reduce latency, hot pages for load balancing and inherent fault tolerance of the Cache resolver are also discussed.

Since the nodes in a distributed systems may keep changing due to network problems or other issues, using a strategy like consistent hashing in general that does not increase the load of any one node may be very useful. Consistent hashing has also been used in distributed hash table to co-locate nodes in a distributed setup efficiently.

The authors select 1000 virtual caches to be mapped and this may become a bottleneck later while scaling to larger systems as each of these virtual caches may be mapped to many real cache machines. Choosing an appropriate number of virtual caches is still an open question.

Summary:
The paper is about a web caching technique that uses consistent hashing to determine which web cache to contact. Some of the authors previously wrote a theoretical paper about consistent hashing so this paper is the implementation of web caching with consistent hashing based on that theoretical paper.

Description of Problem:
Network congestion and server swamping are two main causes for delays and failures for web requests. Web caching techniques has been used to improve the efficiency and reliability of these web requests by reducing the load on the server and network traffic. Web caching can be implemented in many different ways. The simplest approach is to have a single caching machine which has many flaws such as it is not scalable and will become a bottleneck once it reaches its max user capacity or max storage capacity. Some previous web caching techniques already developed involve using cooperating caches but these approaches discussed in the paper relies on sending many broadcasts messages or using a centralized directories. As you can see, implementing an efficient and reliable web caching technique is not an easy task.

Summary of contributions:
Web caching with consistent hashing is made consistent by using a hash function that maps into a range which can be thought as mapping points on a unit circle. When a new a cache is added, only the points closest to the new cache are moved to the new cache so most of the other points are not affected. Hash function provide random distribution therefore it is balancing the load to different caches automatically. Once the consistent hash is computed, a client only needs to contact the web cache so this hashing techniques removes the inter-cache communication bandwidth used by some previous approaches. The authors original idea was to use the browsers to compute the consistent hash but they found out that browsers were not flexible enough. So instead, they created a web caching system using DNS which they called Cache Resolver. For comparison, they tested their system with two other system and observed lower latency and lower miss rate.

Flaws:
Their implementation involved modifying DNS to compute the consistent hash. It seems like it's possible that this added computation could lead to the DNS machine to become overloaded with requests and cause a bottleneck. Also it is not clear to me how the DNS machine will propagate a change such as the addition of a new cache to all the other DNS machines and this seems like there could be other issues where duplication can occur if the DNS machines are not in sync. I don't really understand if this is meant to cache the web pages of all the Internet (seems like there are so many pages that there would be a lot of misses) or to load balance for a specific list of sites.

Application to real systems:
The idea of computing the cache location seems like a good idea since it removes the inter-cache communication of some previous web caching approaches and tries to avoid duplications. The consistent view is also a good idea since it allows to dynamically modify the system without affecting all the caches. The idea of consistent hashing seems good but there seems to be more works to be done in the implementation in order to make it useful in a real world application.

Post a comment