CS 739 Reviews - Spring 2012: Web caching with consistent hashing

« Locality-Aware Request Distribution | Main | Dynamo: Amazon's Highly Available Key-Value Store »

Web caching with consistent hashing

Web caching with consistent hashing Karger, D.; Sherman, A.; Berkheimer, A.; Bogstad, B.; Dhanidina, R.; Iwamoto, K.; Kim, B.; Matkins, L.; Yerushalmi, Y. (1999). Computer Networks 31 (11): 1203–1213.

Review this or LARD for Tuesday, January 31.

Posted by Michael Swift on January 26, 2012 01:57 PM | Permalink

Comments

This paper introduces a new web caching strategy based on consistent hashing which is superior to multicast queries and directory schemes used in many cooperating caches in terms of response time, hit/miss rate, load balancing as well as fault tolerance.
This paper is logically well organized. In introduction part, the importance of web caching is discussed, then some drawbacks of a single shared caching machine are pointed out. In order to overcome the shortcoming of single cache in fault tolerance, scalability and aggregation of lager number of requests, cooperating caches are discussed. Typical cooperating caches are based on primary and second level caches method, but they have certain drawbacks including large network bandwidth usage of information exchanging, big delay for waiting all second level caches responses, and reducing hit rate because of duplication of data among caches.
Targeting these drawbacks of cooperating caches, this paper proposes a cache scheme based on consistent hashing. Consistent hashing has two aspects, one is hashing, and the other is consistent. First of all, the key idea of this paper is applying hashing function to map resources (URLs) to caches by providing clients (browsers) the knowledge which cache should has the appropriate data (corresponding to the URL). In this way, there is no/little inter-cache communication thus less network bandwidth usage. In addition, the unicast way of hashing reduces response delay since there is no primary cache and it is not necessary to wait all responses of second level caches. Other merits of hashing in this application include avoiding the maintenance and query overhead associated with directory based schemes, no new failure points and reducing miss rate.
Applying hashing in web caching to map resources to IP address instead of using primary and second level method is a brilliant idea in this paper. But solution merely based on standard hashing brings certain new problems, including the negative influences resulting from the up and down of machines (resources may be relocated in caches system completely) and multiple views are in the distributed system (certain URL may be mapped to many caches and global load may not be balanced). All these problems require items mapped consistently, thus consistent hashing rather than standard hashing is needed. The most impressive idea in consistent hashing is mapping URLs and caches to unit circle. This design is excellent because new cache will “steal” certain URLs from other caches, thus when a new cached is added, few URLs change. In addition, only a few caches (close to an item in unit circle) are responsible for that item.
In the design and implementation of the whole systems, some ideas are also impressive. The first is the cooperation of browsers and DNS resolvers (browser auto-configuration function performs a standard hash of URL to 1000 virtual cache names and DNS maps virtual cache name onto actual cache IP addresses via consistent hashing).
The extension part in this paper also has some good ideas to ensure better performance of the system. For locality, DNS system is split into a two-layer hierarchy; top direct user’s DNS resolver to DNS servers corresponding to his geographical region; bottom perform URL and IP mapping. For hot pages, “hot” pages are mapped to “hot” virtual name to a list of IP and DNS round robins through the list, besides an adaptive way to establish “hot” virtual names are introduced.
In addition to the technique details, I also learn some general good ideas or philosophy in this paper. One is that in distributed system, we can switch the certain responsibility (such as finding the target cache) from the systems (such as the cache system as a whole) to individual client (such as browsers). This switch may enhance certain capability of the systems, such as scalability, performance and so on. Another idea is some math tools may have some desirable properties or features, we can make use of it in certain applications. Thus, when doing research or working on some applications, we can first analyze their desirable properties and then to see whether there are certain math tools can be used.
One limitation of this paper may be the marketing feasibility of this idea. Since this idea requires certain change of browser, it may be not easy for all browsers to apply this change, thus it may potentially prevent the spreading of this idea in Industry. In addition, I am a little doubtful to the adaptive way how to establish which virtual names are hot. I wonder this method is accurate or reliable and whether the result can converge.

Posted by: Junyan Chen | January 31, 2012 10:49 AM

The paper discusses a new technique of web caching using consistent hashing. Having a single cache for web caching can be unreliable and inefficient. On the other hand, having multiple caches work together could cause additional delay from inter-cache communication. Web caching using consistent hashing aims to collaborate multiple caches, while eliminating inter-cache communication by letting clients decide for themselves which cache has the required data. The authors did an experiment where this techniques result in improvement by more than a factor of 2 on cache miss rate and request latency, compared to previous techniques.

One good idea used in this technique is to externalize the decision for locating the cache with desired data, to the client. This makes the technique more scalable by eliminating one of the computations that would increase linearly with the number of requests. However, this probably would not have been a good decision without the main contribution of the paper, consistent hashing. With this, only a small fraction of the cache entries need to be relocated with an addition or removal of new caches. Therefore, local views of the clients can remain valid longer. In addition, consistent hashing can be useful for other applications of hashing other than web caching.

Although the technique is first intended so that the browsers would do all the work to locate caches, DNS was used in implementation to overcome the limitations of browsers. The authors justify the use of DNS, but one possible weak point of using DNS is that all the DNS servers need to be modified for the technique to work effectively. They need to run dnshelper to run consistent hashing and map from virtual names to the live cache machines. Moreover, the authors argued that per-request DNS lookups can be avoided by maintaining a DNS cache in browser. In my opinion, this is good for utilizing temporal locality, but would not perform well if the requests to the same page are separated by a long time. During that time, the current cache entry could have been replaced, and reloaded into a different cache because of requests from other clients. I assume clients do not request the same page repeatedly in a short period time. Therefore, it would be interesting to learn how efficient the DNS cache really is in this situation.

As I understand it, the web caching presented in the paper is done by intermediate servers rather than the servers that actually host the website. Today, with increasing dynamic contents specialized for each user or geographic location, this way of caching can be less efficient. I believe the caching has shifted more to the servers that host the websites. However, I think consistent caching can be applied in this case too, although I am not sure if the job of locating cache can still be delegated clients.

Posted by: Min Thu Aung | January 31, 2012 04:35 AM

Cache Resolver is a distributed web caching solution that uses DNS and consistent hashing to allow efficient sharding of the cached data into servers without multiple cache lookups or communication overhead and in the face of slowly converging views. It manages this by using consistent hashing -- hashing the requested URL into a hashcode and then checking for the closest available match using their clock system instead of simply modding by the number of buckets (or caches). Because of this, addition or removal of servers only requires the movement of a small amount of data instead of the entire distributed cache’s content.
This paper is attempting to speed up web access by caching the data of overloaded servers. Other caching systems accomplish that, but once the cache grows beyond the capacity of a single server, it must be distributed somehow. Existing solutions are directory-based resolution and two-level cache resolution that uses multicast or broadcast. These suffer from latency, fragility, poor utilization of resources, and noisiness. Cache Resolver solves all of these problems: only one (or a small number) of caches may have a given item, so latency and noisiness are minimal -- one round trip to one server. And because the data can only be cached in a single place, there is more net space available for data, increasing hit rate. Finally, it is robust in that if a server dies, the DNS system can return another related server and continue serving data with no timeouts required on the client side.
Although consistent hashing had already been written about, this paper applies it to the problem of finding which server to talk to while minimizing latency. Their conceptual mapping of hash distance to the unit circle appears to additionally be a novel contribution. Furthermore, the paper interestingly -- though not novelly -- reuses the highly reliable DNS infrastructure to increase discoverability and reliability. Especially interesting is their exploitation of BINDs round-robin scheduling to allow load-balancing and reliability. Performance tests indicate that this solution has a lower miss rate (reflecting better cache utilization) and lower latency (reflecting the quietness of the protocol and directness of the hashing) compared to competing distributed caching solutions.
The paper has two flaws, both reflecting web architecture: usefulness and importance of caching of this style. Though more relevant in the modern web, even when this paper was written dynamic content was quite popular, and cannot be cached in such a naive way. Their example site, CNN.com, was most likely dynamicly generated, even in 2000. This limitation can be worked around by only caching static content (eg, images), but if that was the intended solution, it should have been mentioned more directly. Secondly, it overstates the importance of static local caching, especially under realistic web conditions: making a single request to the website will likely be overall faster on average, since most sites are not overloaded and not in the cache. For such sites, the browser will make a pointless round-trip and pollute the cache with little-used data. That said, within a specific niche, this technology could be useful.
Although there are limitations on the stated goals of the paper, the techniques of consistent hashing to find the appropriate shard of a resource and using the DNS system to provide a reliable (to the point that your system is not the biggest concern if DNS goes down) information distribution mechanism can both be useful in modern large systems.

Posted by: David Capel | January 31, 2012 01:05 AM

SUMMARY
This paper introduces Cache Resolver System which implements the consistent hashing approach of the previous work for URL content caching. It also contains experimental comparison of Cache Resolver with existing approaches.

PROBLEM & SOLUTION
The problem of developing a real effective distributed web caching system is obviously very significant since it is directly related to improving the performance of the World Wide Web. The problem is studied in other papers mentioned in the related work. The main goal of this study is to develop such a system that outperforms other available systems in terms of less cache miss, latency, and reliability. They adopt the consistent hashing approach of the previous work to accomplish this goal. By using consistent hashing they realize determining where to cache URL contents, how to reach those caches from client browsers, and how to deal with dynamically available caches.

CONTRIBUTIONS
Although it is not a direct contribution of this work, the idea of using consistent hashing in web caching itself is a novel and a very significant contribution; since it provides a scalable hashing approach while preserving the load balance. This paper is one of the pioneers that show the applicability of consistent hashing. I realized that consistent hashing is also used other applications of caching. I believe this is mainly due to the guaranteed low complexity of the number of items that are needed to be remapped when new cache arrives, in comparison with the standard hashing.

I like the idea of placing the cache location functionality inside the browser, since it means spreading the computation of hash functions and location determinations among individual clients. This means less work and more flexibility for DNSs. Although they did not implemented this approach, I believe the proposed solution is quite feasible and can be easily implemented today.

FLAWS
Overall, I like the paper and find it very significant because of the reasons above. One of the issues not clear to me is the scalability of this system to heterogeneous caches (how to deal with load balance?). If want to add a new and faster cache, we may want to cache more URLs in it. It might be interesting to further study this issue. For example, one could change the strategy of assigning URL points to cache points (assigning each URL point to the clockwise-next cache point) in a way that faster/high-capacity caches cache more pages than others.

Testing part is not very convincing due to the fact that the experiments are run only with seven machines. Seven machines would tell very little about how well the system perform on a network of thousands of machines with thousands of users, which I believe what the WWW was experiencing in those days.

APPLICABILITY

Consistent hashing has been proved to be an important contribution in caching since it has been used in real world systems especially in distributed databases. On the other hand, web caching is still an important problem and there are many rooms for improvement. For example, large web pages such as Facebook has been developing efficient caching systems on top of their underlying databases, to load user profiles, photos, and etc. in a reasonable amount of time. Although it is not directly applicable to such situations, I believe the idea of hash-based caching to preserve load balance is still valid.

Posted by: Halit Erdogan | January 31, 2012 01:05 AM

Web caching with consistent hashing

This paper describes a mechanism whereby web clients calculate a hash function on URL's and use that to map to a particular cache server. It's unique in that it's a completely distributed system that does not rely on inter-cache communication. Another positive aspect to this scheme it does require significant configuration of clients. To realistically deploy a web caching scheme in a larger network would require this feature.

There are some features of the hashing scheme mentioned that I believe could use more exploration. In general a single web page at one URL will refer to many other URLS ( JavaScript, images, etc ). If the hash function ended up mapping a page and it's contents to multiple machines I believe an important performance enhancement made standard in HTTP/1.1 would become useless. The paper and RFC 2616 ( Hypertext Transfer Protocol - HTTP/1.1 ) was published in 1999. Chances are that the two passed in the ether, otherwise I would have expected some discussion of persistent connections.

HTTP/1.1 made persistent connections the default behavior of any HTTP connection ( section 8.1.2 - RFC 2616 ). With a persistent connection more than one HTTP request can be pipelined through one TCP connection. That pipelining saves not only the overhead of the 3-way handshake to open the connection, but it allows TCP to fully probe the congestion state of the network. TCP starts with a small congestion window, doubling it's size for every acknowledgement. A pipelined connection will have the opportunity to open it's congestion window to full size and use the network more effectively. If a page and the URL's it references are all mapped to different caches then persistent connections would not be advantageous.

When operating in 'common mode' as discussed in section 4.2 when a miss occurs the cache will fetch the data from the original content server. A potential extension to this mode is that the cache would not only fetch the original document but start prefetching URL's required the render the document requested. Adding this prefetching functionality would require some of the very inter-cache communication this design avoids.

Poorly designed web pages could also break the consistent caching scheme. URL's used to fetch identical data are not always unique. For instance when posting form data, or making an ajax call parameters may be passed in the URL. These parameters may not effect the content of the returned page, but may make the caching algorithm map the page eventually to all cache servers.

Section 5.1 (Locality) talked about how user latency is greatly influenced by the proximity of the cache servers. Their solution to this problem is to select cache servers that are physically close to the client. That can be insufficient as it is not geographic distance that matters, but network 'distance'. In an admittedly contrived example you could be sitting right next to a cache server, but on a separate network. The only connection between your network and the cache network could be on the other side of the country. Further, it was not discussed how large these physical regions might be. If you have a number of cache servers, in a region it may be unlikely that they would be close to one another. So as a single URL could be on any of O(log c) cache servers one might tweak the DNS resolver to map requests not only onto cache servers that are not busy, but cache servers that are close ( in a network distance sense, not geographic distance ) to the client.

As a farther complication simply using the network distance between a potential set of cache servers and client would not be enough in some cases. Say for instance the client is 10 ms away from cache server set A, and 20 ms away from cache server set B, selecting by network distance between the client and the cache server set would have us select cache server set A. If A is 30 ms from B, and we want to retrieve some content from a server in the same location as B our selection of cache server set A would increase latency not decrease it.

It is also unclear how caching would inter-operate with secure web transactions. Can you cache data transferred over a secure connection? Would you want to?

Posted by: Matt Newcomb | January 31, 2012 12:46 AM

This paper is about a new low overhead web caching technique that utilizes a hashing technique called consistent-hashing.

The paper aims to design a web caching system with the following properties:
1. Eliminates server-to-server communication on a miss,
2. Balances the load on the cache-servers,
3. Increase hit rate of any incoming request (data-locality),
4. Providing reliability by eliminating any single-point failure (central repository), and
5. Adapt to dynamically changing set of active servers with less/minimum performance penalty.

The major contribution of the paper is that the authors succeeded in achieving all the requirements/goals above with a very simple idea of consistent-hashing. Hashing the requested URL with a consistent, standard hash function on the client to decide on which server to contact results in many good consequences : 1. completely off-loading the responsibility to the client, 2. consistent-hashing ensures load-balancing, and also 3. reduces low/no overhead in a dynamically changing environment. This also insures that there is data-locality of cached information on the servers which is important for high hit rate. Although this solution required the system to ensure consistency of the list of cache servers on the client, the authors solved this problem (unintentionally) by abstracting away the exact cache-servers with virtual cache names.

On the other hand, though the authors explained the theory behind consistent-hashing, they overlooked some of the requirements for it to be effective in load-balancing. In one of the extensions for supporting spacial locality of data, the authors claimed to have solved it by routing the requests to a subset of servers in the client's region. This seems to break some of the rules of consistent-hashing by not obeying the choice of the hash function (a point on the unit circle). This might have been avoided if they had suggested a method of recursively using consistent-hashing in a hierarchical fashion (state -> district -> city, following the same analogy of a unit circle this would be like a set of concentric circles). It is also unclear on how the web caching system is boot strapped with the cached data from the source servers.

The idea is partially applicable to the current scenario where most of the load on the web-servers is off-loaded to the client machines which are very much capable of handling them than in the time the paper was written. On the other hand, the solution isn't scalable enough to handle the rapidly growing Internet. Few 1000 (flat) virtual cache names would be insufficient to be effective in caching the WWW.

Posted by: Venkatanathan Varadarajan | January 30, 2012 11:22 PM

This paper describes the design and implementation of a web caching system using consistent hashing. The proposed system differs other web caching systems because only a small fraction of resources need moving when new machines are added or machines are down.

The proposed system is aiming at solving the problem of performance and scalability issues with the dramatic growth of numbers of requests. Web caching is a significant method to achieve good performance and scalability while previous strategies involve too much inter-cache communication or have the problem of single point failure.

Consistent hashing is a brilliant idea now it is introduced into a real web cache system. The key contribution of the paper is that it describes a feasible way to use consistent hashing and proves its validity with tests. They implement consistent hashing at special DNS severs. This overcomes the problem of not being able to load scripts of finding cache severs dynamically, so that the up and down of cache machines can be hidden from users. It also summaries the theory paper of consistent hashing. Even though this paper is not the origin of consistent hashing, it presents the consistent hashing in a way that is easy to understand.

One of the flaws of this paper is that it omits some important issues in implementation. For example, it does mention that all the cache points can be store in a binary tree so that finding the first clockwise cache machine can be done in O(logn) time. However, it does not further discuss how to store and maintain this important binary tree. Is this tree stored on a centralized server or decentralized across many servers? If it is decentralized, then how can we maintain this tree between different severs? According to the implementation section, it seems like the tree will be decentralized stored on DNS severs. However, how to maintain this tree needs further discussion. Other flaw is that in theory, we generally assume that hash function can spread all keys evenly and randomly. But in practice, the choice of different hash functions impacts the performance greatly. This problem influences the final performance in two ways because both resources(URL) and the cache machines and their copies needs hashing to the ring. How to choose a better hash function also needs more explanation. Also, consistent hashing seems not able to process range query since successive resources may be hashed to machines far away.

In all, consistent hashing is a very useful method to achieve load balancing. Several current real distributed system are using it, like Cassandra and Dynamo. But due to its limitation on processing range query, other methods are also used to provide better load balancing.

Posted by: Xiaozhu Meng | January 30, 2012 09:59 PM

This paper takes theoretic hashing work and applies it to web caching. The authors give good insight on how to handle many of the practical concerns that come up when dealing with a distributed cache. While the theoretic paper laid the groundwork for minimizing cache misses and duplicated data, this paper’s primary focus is leveraging the advantages of a consistent hashing cache without breaking backwards compatibility with the existing network protocols.

While the paper focuses on consistent hashing use in the world wide web, the techniques used here are more generally applicable. In general, if a client has a resource locator for some piece of data, the standard approach is to have the client use a primary server which would know where to find the data. This paper’s primary contribution is allowing a client with a resource locator to locate the data itself without the need to an intermediate server.

While consistent hashing has nice very nice properties for performance and scaling, the authors struggled with how to leverage this using existing browsers. The paper seems to be split into parts: an insightful discussion in using distributed hashing and an explanation of creative use of DNS and browser behavior. I’m not sure how good it is for this paper to have to have this two-sided aspect. In one sense, this paper would be more coherent if this consistent hashing was applied to a specialized system where they could control the protocol and fully leverage all of the advantages. On the other side, because they did use network creative to make this work on the Internet, I feel the paper becomes generally more interesting and directly applicable.

One particularly interesting result was the behavior of this cache with non-uniformly distributed access patterns (i.e. their CNN example). This is where the paper becomes somewhat unclear. If you take the more ‘specialized systems’ perspective on the paper, you could argue that this behavior is not a flaw because their goal was to use consistent hashing and that system and they made it work. If they were in some specialized system which did not have such spikes, their system would be very good. If you take the other side of the paper, which sells this as a solution to all Internet caching, you could argue that this inability to handle non-uniform access is a flaw. They build some mechanisms to handle it, but, it seems to come down to simply using a different caching strategy for the poplar data. That suggests they’re over selling consistent hashing as an general solution to Internet caching.

I think that consistent hashing applied to a specific niche system would have yielded beautiful (though more limited) results. While it was interesting to see how this hashing can be applied for web content, it seems like the one of the findings is that two different caching mechanisms may be correct solution, one of which is consistent hashing.

Posted by: Victor Bittorf | January 30, 2012 09:58 PM

This paper discussed how the use of Web Caching with consistent hashing can speed up web requests by allowing alternate caching servers handle some of the http client requests. There procedure for caching differs from existing caching techniques in that it places the client in charge of determining what cache to use using tricks with DNS and client javascript code. The way this caching scheme works is by mapping each URL to a hash location in a [0,1] range then maps servers in that range. When a page is requested by the user a hash is generated by the client of the URL. This URL is then mapped to this space and the closest server that appears in the [0,1] range greater then the hash is used to obtain the page. The author's saw some pretty decent performance improvements using this style of caching in comparison to regular caching at the time.

This paper overall was pretty good. It seems to be a precursor to the system currently in place on the web today (use of CDN's such as Akamai which was founded by MIT graduates at around the time this paper was published). However this paper does have some faults. The original idea seems somewhat flawed from the point of view that its likely not the best setup to allow the client to select its own cache server (as originally suggested by the paper). The reason for this being that if its up the client on which cache server to select what happens if malicious clients hit cache servers that were never intended on holding the hashed file request (or bypass the cache totally and hit the main server). This issue was not really addressed in the paper and it seems that a small number of malicious clients could kill cache server performance by requesting files from cache servers that are not suppose to host that specific hashed file. Another interesting point that they bring up is the inability for the client to actually select the proper cache server due to limitations of the browsers at the time. Its a bit unfortunate that this paper came out when it did since technologies such as Asynchronous Javascript and XML (known as Microsoft XMLHTTP at the time) which could have overcame this issue was released around the time of the publishing of this paper. I do not fault the paper in not using this technology but it would have been interesting to see how their original compared idea of having the cache totally selected by the client compared to the DNS “hack” version that they showed in this paper.

Posted by: Benjamin Welton | January 30, 2012 09:32 PM

The paper talks on how web caching can be made scalable by using the technique
of consistent hashing. The authors observe the difficulties in other caching
mechanisms like Directory, Centralized and co-operative caching systems. The
main problems arising out of such systems are; 1. They perform worse in case
of a cache miss when they try to figure out which cache has to be contacted
next. 2 - They consume a lot of bandwidth and have unwanted replication at
times. When cache sizes are small, unwanted replication can cause false misses
and lead to worser performance. Through consistent hashing, the load on each
cache is balanced, the number of caches responsible for each URL is limited by
a number and finally, when a new cache is added to the system, there is not
much invalidation or relocation of the existing cached copies.

To adopt consistent hashing, the authors have used a combination of client
side javascript and the DNS resolver that a web browser usually talks to. The
javascript first maps URL to a virtual cache name, taking into account the
geographical location of the user. When the DNS request is made by the
browser, it first resolves the location part of the virtual cache name using a
top level DNS resolver directing it to a bottom level DNS resolver which then
resolves the URL part which corresponds to the consistent hashing for the set
of caches responsible for the URL in that particular region. As an extension,
they also propose the case where a single virtual cache name can be resolved
to multiple caches in a round robin fashion to serve 'hot' pages in the bottom
level DNS resolver. The paper also identifies that different clients can have
different views of the caches and argues on the bounds for replication factor
and load on each cache. For browsers whose DNS resolving mechanism can fail at
times because of caching the DNS resolutions recently made, they provide a
simple work around.

The paper relies on the DNS resolver primarily to enforce the consistent
hashing. Though this makes the DNS resolver the single point failure, they
rationale it by arguing that the DNS resolver is already a part of the system
and their mechanism only piggybacks on the existing use of DNS resolver. This
sounds reasonable and the idea seems to have caught on and is still actively
used upon. But the user would be forced to use a particular DNS resolver in
that case whereas today, a user can specify the browser to contact a DNS
server of his choice (OpenDNS, Google DNS, DNS provided by ISP). In such
cases, the DNS can redirect the request to a local load balancer, based on the
user's location, which can then perform the function of the bottom level
DNS resolver in their system.

The main contribution of this paper is trying to make use of a simple concept
like consistent hashing to solve one of the critical problems during the burst
of internet. Due to the simple yet effective nature, this is still being actively used in various forms where load balancing needs to be achieved across several resources which can potentially provide the same service.

Posted by: Sathya Gunasekar | January 30, 2012 08:37 PM

This paper talks about a distributed web page caching system where a pseudo-distributed naming service using DNS is used to map web pages to proxy caches. The system is aimed at reducing the duplication of content so as to reduce capacity misses.

The state of the art distributed web caching systems at the time of this paper were either hierarchical or directory-based. The caching schemes introduced one or many of the following problem:
- There were multiple copies of the content, increasing capacity misses
- Multicast and broadcast mechanisms required by the caching systems consumed bandwidth
- Requirement of NACKS in certain schemes increased latency of a cache miss
- Directory based schemes formed a central point of failure

Contributions and key features:
- The paper aimed at distributing content as evenly as possible among cache servers. It accomplished this by the usage of Consistent hashing. Using consistent hashing, every cache and every web page would be a point in a unit circle so that every web page gets mapped to the cache nearest to it in the clockwise order. In the absence of churn in the number of caches, each page gets mapped to a unique cache.
- Advantages of this scheme are:
- There is only one miss per page - no latency increase while waiting for cooperating caches to reply.
- There is no multicast - saves bandwidth
- There is no directory consultation and invalidation - this means that there is no maintenance overhead and single points of failure.
- The distribution is content based rather than need based - a cache's contents cannot be purged by o
- The key trick used by the paper was, instead of using a centralized directory to do the mapping, it designated the job of mapping to the client browser and the DNS, creating a pseudo-distributed registry service.
- Another nice thing about the approach is how it handles churn in the number of caches, gracefully. If caches are added (or removed) from the system, an ordinary mod based hash function may completely change the existing mapping leading to redistribution of, possibly, all contents. However, consistent hashing will only redistribute a small portion of the content.
- The idea of consistent hashing has created two implications -
1) pages are mapped to caches uniformly and one page gets mapped only to few caches (one cache, ideally)
2) minimal relocation of content on cache churn.
This has lead to, probably the biggest win of this paper - reducing capacity misses by having as few copies of an item as possible, while still conserving bandwidth, reducing miss latency and keeping the system distributed.
- The browser of the client statically maps a URL to set of virtual caches. The browser then selects one of these and sends out a DNS query. The DNS server maps the virtual cache to a physical cache selected from a dynamically changing set of physical caches using consistent hashing.
- They propose extensions to their work.
- Duplicating the servers based on geography and have the DNS resolve to local servers.
- They load balance the content through consistent hashing and they load balance requests by mapping hot pages to a set of caches (instead of one) and let the dnshelper round robin among those caches during cache resolution.
- They achieve fault tolerance by returning a list of virtual caches for a URL.

Flaws:
The paper, as such, doesn't have major flaws. Although, I was uncomfortable with a few things:
- The paper claims to have eliminated a single point of failure by using DNS for cache resolution. Although the DNS is a distributed name service, their design is theoretically not very different from CRISP which has dedicated name servers. Though the DNS won't technically act as a single point of failure, theoretically, the design is still not distributed. A perfect distributed design would have the name service designated among the clients (browsers). However, the paper does, in a way, admit that this would be the ideal design and they didn't go for it due to technical difficulties.
- They could've incorporated the extensions into their experimental evaluation. Without the extensions, the key tradeoff for having few copies of contents is the increasing load on the server. Certain characteristics like server load and miss latency, in the presence of the extensions, could have been studied.
- The experimental setup was rather small - it would have been nice to explore if consistent hashing scaled well with the number of cache servers (monitoring them and actually doing the hashing).

This paper is very much applicable to today's proxy caching scenario. Web caching is till being done by proxy servers and DNS is still the same, so the unmodified idea is still very valid today. They could be made purely distributed - have the browsers do cache resolution. I guess consistent hashing could be used in the majority of web scale key value stores used today. (I guess it might already be used, but I just thought that would be relevant).

Posted by: Venkatesh Srinivasan | January 30, 2012 03:30 PM

The paper proposes a novel method of web caching which is based on consistent hashing for improving the latency of content delivery for web users.

Motivation:
The steep increase in the number of users of the web leads to failures and delays in content delivery because of network congestion and server swamping. Web caching is a strategy that is often used to solve this problem. The existing web caching techniques such as cooperating caches with multicast queries and directory queries have several shortcomings such as additional bandwidth consumption for broadcast queries, time delays waiting for cooperating caches, single point of failure with directories and duplication of data.

Summary:
The paper gives a clever solution to this problem by proposing a simple yet effective strategy where the browser is enabled to choose the appropriate cache for a given URL directly based on consistent hashing. This saves times that would have been spent otherwise in contacting two layers of caches and also reduces network congestion. Though this could be achieved with the help of any standard hashing technique the need for reduction in the number of redundant copies maintained due to asynchronous behavior of the network, motivates consistent hashing where only a few URLs get mapped to a new cache whenever an addition to the system happens. Due to lack of support by the browsers to support consistent hashing in a dynamic caching environment, the system seeks the help of DNS to do the mapping.

Contributions:
1) The major contribution of the paper was the decision to push the responsibility of choosing the appropriate cache to the client instead of depending on the primary cache which not only improves latency but also avoids a whole lot of bandwidth consumption and complexity associated with multicasts and directory queries.
2) Using consistent hashing instead of standard hashing to reduce redundancy.
3) The clever idea of generating virtual names and having the DNS take care of mapping them to actual cache IP addresses.
4) Embedding location information in the virtual names and having a two layer DNS hierarchy to enable locality.

Merits and Limitations:
The biggest merit of the paper is its simplicity. There are a few limitations with the basic design of having every URL map to just one cache server to avoid duplication. One obvious drawback is that if there is a hot page with a large number of hits all the requests will swamp the cache server holding that page and also cause network congestion. Another disadvantage is that when a cache failure happens all the requests will get directed to the original server again causing swamping and congestion. The paper does address these issues with techniques such as mapping the hot pages to more than one cache server (by the DNS) and generating more than one virtual name (by the browser script). Though the paper gives the solution for the problem with hot pages, the solution to identify the hot pages seems to be a little complicated and may cause a lot more duplication than desired as it caches all the pages in a cache server with heavy load, to all other cache servers. There are no major flaws as such, in my opinion.

Overall, the paper is well written focusing on a simple yet effective strategy that improves hit rate and reduces network congestion and duplication. The authors have done a good job at identifying the possible limitations and addressing them.

Posted by: Madhu Ramanathan | January 28, 2012 12:12 PM

CS 739 Reviews - Spring 2012

Web caching with consistent hashing

Comments

Post a comment