Paper Discussion - Fall 2008 - CS 736: The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors

« Xen and the Art of Virtualization | Main | Experiences with Processes and Monitors in Mesa »

The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors

Thomas Anderson, Edward Lazowska, and Henry Levy. The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors. IEEE Trans. on Computers 38(12), December 1989, pp. 1631-1644.

Reviews due Thursday, 10/2.

Posted by Michael Swift on September 29, 2008 12:16 PM | Permalink

Comments

Summary:
This paper discusses methods for synchronizing large numbers of threads on multiprocessor machines without shared memory. It also presents benchmarks for those methods and compares them to a new spin-lock algorithm that uses Ethernet's back-off algorithm.

Problem:
Synchronization overhead eats into possible performance gains of employing multiple processors. In choosing an synchronization method, the choice is often between latency and throughput.

The thought in the late 1980's and early 90's was that the road toward greater processing power would be more processors rather than faster ones. While that didn't hold at the time, twenty years later it is now true.

Work Summary / Contributions:
This paper describes the pros and cons of various locking schemes using various levels of granularity, such as using global data structures with single lock all the way to independent data structures for each processor.

The paper notes that while spin-locks waste CPU cycles in waiting to acquire a lock, this can be cheaper than context switches induced by using blocking. The paper then goes on to analyze various methods of spin-locks.

The simplest method is test-and-set, but this induces bus congestion. To lessen the congestion, the authors implement an Ethernet-style back-off algorithm as alternative (where back-off times are randomized). They discover that of all the algorithms surveyed, this scales best.

Disadvantages:
As with Ethernet, the back-off algorithm is not fair to new threads. When collisions occur, active threads can starve new threads because the new, non-active threads must increase their wait counters. Additionally, the algorithm provides no mechanism for priority.

Flaws:
The benchmarks run for this paper are purely controlled tests using null threads, which may not represent real-world (non-null thread) scenarios. Further, the authors do not test their algorithm shared-memory multiprocessors. Additionally, while this is not a flaw in the paper but should be noted: a large component of some of the benchmarked speeds involves bus overhead, but bus to processor speed ratios have changed since the paper was written. It would be interesting to see these benchmarks run again on modern hardware.

Posted by: Dale Emmons | October 7, 2008 07:48 AM

Summary

In The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors, Anderson et al. describe five different approaches to organizing thread data structures. They then proceed to highlight trade-offs made by these techniques by measuring their performance under a controlled workload. In addition, three different methods for acquiring locks were weighed against each other, including a new technique, exponential backoff.

The Problem

In 1989, threading was still a relatively new feature with many competing, fundamentally-different implementations. This paper painstaking enumerates these differences and then proceeds to weigh the pros and cons of each. In particular, an effort is made to characterize the latency/throughput trade-space different approaches to threading fall within. Scalability concerns for the different techniques are also highlighted.

Organization Discussion

To intelligently discuss the differences between different threading architectures, the authors first highlighted a number of thread properties: the program counter, stack, and control block. They then noted the following “central data structures” used to control threads: the ready queue, stack free list, and control block free list.

The authors also explain the different options on the table with respect to instantiating these structures. One can place a single lock on the central data structures, or use separate locks for each structure. The central data structures can be migrated to individual processors. An idle processor queue can be maintained.

The authors also discuss how the Sequent Symmetry Model A's XCHGB instruction can be used to obtain and release locks. Once can issue this command repeatedly waiting for it succeed, issue it once and take advantage of coherent caches to update a processor on its success, or issue it repeatedly and randomly backoff upon failure.

Results

Single-lock methods reduce throughput but increase latency. Finer granularity locking on the core data structures decreases latency, with improved throughput. Maintaining a copy of the core data structures associated with each processor improves both throughput and latency, but at increased cost in terms of total memory used and software complexity. Idle processor queues turn the “processors searching for work” paradigm on its head, decreasing latency in when processors are not overburdened but increasing it when they are.

The amount of user computation per thread timeslice changes the speedup seen by different approaches to storing the central data structures. Not surprisingly, for large timeslices, the speedups are roughly the same.

Exponential backoff is shown to scale better than spin-locking (in Figs. 6 and 8) for a simple counter-incrementing workload. Unfortunately, it is difficult to see exactly how this technique is scaling past 11 processors, when it starts to take a significant performance hit. This may be due to limited hardware available to the authors, or possible because this trend is linear and will slow down more than read spin-locks. It would also have been interesting to see the backoff tested on a more sophisticated problem.

Posted by: James Jolly | October 7, 2008 07:46 AM

The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors

Summary

The paper presents, analyzes and evaluates ( both experimentally and analytically ) various design alternatives for Thread management and Spin lock Management assuming threads are used to take advantage of fine grained parallelism. It also brings in innovative optimizations like 'exponential backoff while trying to acquire a lock' , 'using idle processor list and making work go in search of a processor' in some of the alternatives.

Description of the problem being solved
The problem is to analyze various design alternatives for Thread Management and Spin lock Management and therby to implement a fast Thread package to exploit fine grained parallelism.

Contributions of the paper

1) Observing that Spin-waiting can delay not only the processor waiting for a lock but also other processors doing work because of bus contention.

2) Reducing costs of spin-waiting by ethernet style back-off.

3) Using per-processor data structures to avoid locking and improving latency is a contribution.

4) Using queuing models to analyze the effect of several factors on the performance of Thread management.

Flaws in the paper

1) The paper does not discuss how thread priorities will be handled in the various alternatives. But I think Thread priorities are not critical in the context of fine grained parallelism.

2) In Exponential backoff alternative for Spin lock management, there is a possibility for indefinite starvation which is not handled.

3) Fairness is not at all discussed ( not even mentioned as to why it is not considered ) in the Local Ready queue alternative for Thread management.

What is made easier?

Taking advantage of fine grained parallelism in applications using Threads is made easy and efficient with the fast Thread library.

What is made harder?

Possibilities of indefinite starvation during exponential backoff and lack of fairness during scheduling with local ready queues might make some specific applications not take advantage of fine grained parallelism ( I cannot think of any specific application though ).

NOTE: I think the paper is sufficiently detailed and thorought in the context of fine grained parallelism. For example, priorities between threads might not make sense in the context of fine grained parallelism and may be that is the reason why this was not considered.

Posted by: Leo Prasath Arulraj | October 7, 2008 07:29 AM

Summary
The paper proposes several thread management alternatives for shared memory multiprocessors and measures their performance. Based on these measurements, it concludes that per-processor data structures can be used to improve throughput significantly. Additionally, the paper proposes the ethernet-style backoff algorithm to reduce bus-traffic caused by spin-waiting.

Problem attempted

The objective of the paper is to assess the performance of of various thread management alternatives for shared memory multiprocessors. Besides, the paper also measures the performance of various methods used by processors to queue for locks.

Contributions

1. When all the data structures for thread management are protected by a single lock, the throughput is reduced significantly due to most of thread management path being serialized. In contrast, having a separate lock for each central data structure, increases throughput as thread activity is split across several locks. But having separate locks increases the number of locks to be acquired and hence increases latency.

2. Avoiding locks for free lists, by having local free-lists and maintaining just a central ready queue improves both latency and throughput when compared to previous alternative. Latency is brought down because of fewer lock acquisitions required per thread and throughput is increased because only accesses to ready queue are serialized. The trade-off here is that one processor could have an empty free-list requesting additional space from heap while another has many entries in its free-list. In order to ensure reuse of deallocated structures, the paper proposes to maintain a global pool of free lists and a threshold T on the maximum size of free-list. Free-lists exceeding the threshold size return half their entries to global pool and when a free-list empties it gets T/2 entries from pool.

3. Having a central queue of idle-processors brings down latency only if there are any idle processors and increases latency if all processors are busy. Throughput remains same as there are any way two locks involved in handling idle-queue - one by idle processor for enqueuing itself in idle-queue and one by creating processor for dequeuing it.

4. The paper observes that, even with local free-lists, locks on ready queues limit throughput to a few concurrent thread operations. So the only way-out is to have per-processor ready queues. But it is important to avoid an idle processor having an empty ready queue even when there is just one thread in some other ready-queue. Locking each ready-queue and making idle processors scan ready-queues is one way to avoid this.

5. Processors that are busy-waiting (spinning) for a lock can slow down processors doing useful work due to bus contention. One way to reduce this impact is to spin read the lock memory location instead of spinning on the test and set instructions. Unlike test and set instructions, reading lock memory locations is done in the cache and does not increase bus traffic. Though there is initial overhead when all processors invalidate cache copies after failing on a test-and-set instruction, there is no overhead from then-on and hence longer critical sections result in better performance.

6. Instead of processors continuously trying to acquire a lock, they can assess the probability of getting a lock based on experience and try only when the probability is high. Apart from being unfair, it takes more time for a spinning processor to acquire a newly released lock in this mechanism. Besides, when there are a number of spinning processors, bus could still be populated with cache misses of a spinning processor.

Flaws

The paper doesn't explain clearly the significantly better performance of the ethernet style back-off mechanism when compared to the spin on read mechanism in spite of the draw-backs it mentioned for the back-off mechanism. Though the spin-on-read mechanism comes very close to back-off mechanism when the critical section grows bigger, its processing rate is about a half of back-off mechansim for small critical sections.

What it makes easier/harder

Since the purpose of the paper is to propose various alternatives and measure their performance, there is nothing specific that is made easier/harder.

The authors point out that thread management routines must be kept simple. Since only about a 100 instructions are present in thread management path, adding even a few instructions impacts performance. Thus, for example, assuming that the idle-queue is typically empty, the complication involved in maintaining an idle queue for processors significantly increase overhead. Hence the point driven is, any complexity in thread management routine is not going to payoff unless there are enough number of processors.

Posted by: Balasubramanian Sivan | October 7, 2008 07:09 AM

The performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors

This paper discuses the latency and throughput of different thread management policies for shared-memory multiprocessors. Five different thread management options are compared: A)single lock, B) multiple locks, C) local free list, D) idle queue and E) local ready queue. The tradeoffs between these alternatives are examined and then measurement results are shown for synthetic benchmarks. An Ethernet based algorithm to queue on locks is also presented as an alternative to spinning locks.

The purpose is to compare the performance of five different thread management policies in multiprocessor systems with shared memory; the focus is latency and throughput. Different alternatives to spin-lock are studied to improve performance when processors are queuing for locks.

Thread management on multiprocessors is still an important issue nowadays. Multiprocessors can have from two cores to hundred of cores; the alternatives proposed present different performance depending on the number of processors available. Depending on the system one or the other will be a better choice.
This paper proposes to only allocate stack space to a thread when it is started, meanwhile only a control block is assigned to the thread. They also propose to have lists of free control blocks and stacks, this simplifies the allocation and deallocation of space for threads.
As a way of reducing the number of locks that need to be acquired, more state can be kept in each single processor: then the processor does not need to lock for accessing these local data structures, because no other processor is going to access them. Local free lists and local ready queues are alternatives that exploit this idea. Some mechanisms for balance needs to be used with these thread management alternatives.
Single lock seems the most reasonable alternative for very few processors, multiple locks does not perform very good on few processors. Local ready queues is the best alternative when the number of processors increases. Multiple locks, local free lists and idle queue are better alternatives than single lock for, but they reach a limit when the number of processors increases. Using an idle queue is not useful when there are more runnable threads than processors.
Spin waiting on locks consumes bus resources and slows down the processor that owns the lock. Ethernet backoff is a mechanism based on probabilities, this has been proved to be a reasonable solution for other problems. The performance is improved with respect to spinning, but at the cost of fairness. This may not be suitable to some applications.

The number of processors that are considered for the experiments goes up to 18, this is a rather small number and does not show how these management alternatives would scale for some of the super parallel computers that are available nowadays: local free lists and idle processor queue give a reasonable performance for around 13-15 processors, but it seems that performance decreases when the number of processors increases above that number. The benchmarks that are being used to measure the performance are very simplistic and unrealistic. The Ethernet backoff mechanism that is purposed for queuing for locks is not a fair mechanism, starvation can occur for a processor trying to get a lock, no solution is presented for this problem.

These paper presents different thread management alternatives, for all of them it is shown that the more complex the mechanism is the more latency, but having more complex mechanisms can provide higher throughput; this is why more complex mechanisms are only appropriate for higher number of processors.

Posted by: Paula Aguilera | October 7, 2008 04:06 AM

Summary
The paper discusses the implications of alternatives for thread and lock management, specifically with regards to a multiprocessor system.

Problem
At the time this paper is written threads have become a common feature of languages/operating systems. Differences in methods of thread management that seem small can have a large impact in performance (throughput and latency), and the authors wish to determine the nature of this impact. This is especially true for multiprocessors.

Summary/Contributions
The authors explain several alternatives for thread management, several alternatives for lock management, give experimental results for each, and provide analytical explanations that appear to do a good job of explaining the results.

Five thread management alternatives are presented. Singe lock is as it sounds – one lock protecting the central data structures. The multiple locks approach provides a separate lock for each data structure. The remaining three are multiprocessor approaches. Each provides per-processor free lists of control blocks and stacks. The local free list approach provides a central locked ready queue. The idle queue approach provides a central queue of idle processors. The local ready queue approach provides a per-processor ready-queue. The results show that the local ready queue achieves relatively greater throughput, and that this becomes more pronounced as the number of processors increases. However, as to be expected, the more complex methods have greater latency.

Three lock management alternatives are also presented. Spin on xchgb (test-and-set) has a drawback of consuming bus resources on each xchgb, so this is addressed with the Spin on read method. In this method spinning is done in the cache; the cache copy will be invalidated after the lock is released. The Ethernet-style backoff method tries to estimate the likelihood of success in attaining a lock and try only when likely to succeed. This brings good performance but at the cost of fairness. If you’ve been failing you’re more likely to believe you will continue to fail and the less often you will try for the lock.

One question is how the experimental results relate to actual usage. For example, in Fig. 7, we see the performance of spin read method approach that of the backoff method as the size of the critical section increases. Is it then worth it trade fairness for a potentially small increase in performance? This seems to merit further consideration. In many ways this paper is simply a confirmation that choices with regards to thread management have performance implications, and that, generally, this takes the form of trading better latency (and simplicity) for better throughput. The authors, however, do a good job throughout the paper of being upfront with the potential downsides of the methods presented – even their favored methods.

Posted by: Sam Javner | October 7, 2008 03:51 AM

Anderson and Levy make two valuable contributions in, "Performance Implications..." The first of these is a characterization of thread-management techniques employed in multiprocessor systems, where the distinguishing feature is the use of queues per the number of processors in the system. Having this, they are able to reason intelligently about the comparative performance effects between them. The second contribution made is an optimization: the use of binary exponential back-off in the bodies of spin locks theoretically improves some of the worst-case effects of other more simple spinning methods. The measured results confirm this effect and also confirm the expected results of caching hardware on such performance measurements.

A characterization of thread management techniques empowers the system designer to make a better system: With an improved understanding of architecture-dependent effects in the presence of locking, pathological performance can be explained and its improvement becomes a tractable endeavor. The authors recognized this connection, and were themselves able to imagine the positive effects of exponential back-off from an understanding of cache coherence, in particular. The discussion given here also makes concepts like, "Processors seeking work," or, "work seeking processors," available to system implementors with solid documentation of their implications on a system. Just as design patterns expand the vocabulary of the application programmer, such a characterization promises to improve the system programmer's capacity to express the features he implements.

The authors' evaluation of the back-off technique is lacking as it does not quantify the effect of the optimization on "real" code, but only on controlled experiments. It's also unclear whether their expectations of threads' behavior matches that which is common in modern software. The concern over latency, in particular, is something that I find strange: Per my own experience, threads are only very infrequently created or destroyed in large numbers. This, together with very fast processors, makes setup- and teardown-cost entirely negligible compared with the overhead of scheduling overhead and cache effects during the thread's execution. Figure 7 demonstrates, furthermore, that spinning over a cached read is virtually just as good as back-off for non-trivial critical sections. All this does not deny that back-off has good properties, but it calls to question whether those properties are actually important to the world of concurrent software.

There is a downside to binary back-off which presents opportunity for related improvements. In the (possible, but perhaps unlikely) event that many threads are terminated simultaneously, back-off will cause idle time on processors, and a spin-on-read lock would be more efficient. Since the system has a knowledge of threads enqueued and dequeued from run-queues, it is be possible to witness the onset of this condition and make a policy switch to agree with it. Any negative performance effects of such a switch would be worth exploring. The broadest impact from this paper, however, is to be had by the system programmer seeking to explain and improve upon the performance of his system.

Posted by: Tack | October 7, 2008 02:34 AM

Summary:
The paper presents an evaluatory view of some options on how
1. thread related data structures can be organized
2. spinning in wait for locks can be accomplished
efficiently on shared memory multiprocessors. It tries to reason based on the analysis that decentralized maintenance of data structures, and ethernet style backoff for spinning on locks are the best choices.

Problem the paper addresses:
The paper addresses overheads associated with multi-threading particularly in the context of shared memory multiprocessors. There are two major overheads involved; one is the creation and maintenance of thread related data structures - stacks, control blocks, and queus, and the other is bus contention arising out of waiting for locks. For both the cases, the paper analyzes the optimality of various approaches available.

Contributions:
- Awesome analysis of how various design choices affect performance of threads. The cases by case analysis gives good explanation of how througput and latency are affected. The tradeoff between throuput and latency in design choices is highlighted clearly.
- Initializing the stack only when the thread is actually started is a cool idea, as it reduces the memory requirements.
- The paper pushes the ethernet style back off algorithm for spinning for locks. This works out nicely when the number of processors increases, as bus degradation due to contention is minimized (however, please refer to Flaw no. 2)

Flaws:
- As many of my classmates have pointed out, the evaluations presented in the paper is not based on real workloads. Contention could be much lesser with real workloads as the "work" that the threads do would normally require a random amount of time.
- the paper highlights the ethernet style backoff algorithm a lot, but I felt it's no big deal. One reason is the flaw mentioned above. Another is that ethernet style backoff makes more sense when there are many processors; but with many processors, shared memory multiprocessor would make less sense as memory quickly becomes a bottleneck.
- Again, the ethernet style backoff has a wrong notion of preference in that it favors a newly arrived job over one that has been waiting a while. (potential for starvation)
- with the recent trend towards multiple cores on a single chip, I wonder if cache invalidation etc will be so expensive as to be a major design concern anymore.

Organization:
Although the paper mentions many approaches available for maintaining data strucutures and for spin locking, the bias is towards decentralized maintenance of data structures, and towards the exponential backoff algorithm for spinning for locks.

what it makes easier:
-Depending on the number of processors on your system, you can choose the approach to data-structure maintenance and spinning waiting locks that best suites you. Hence, you can easily achieve high throughput and least latency.

what it makes harder:
- enforcing thread priorities might be tough with greater decentralized maintenance of lists. When I find a thread in another processor's queue, how do I know that is the pending thread of highest priority?

Posted by: Varghese Mathew | October 7, 2008 02:25 AM

Summary:

In this paper the authors describe algorithm alternatives for thread management in shared-memory multiprocessors. They compare the performance of several locking techniques and spin-lock management alternatives and present their advantages and drawbacks.

Problem:

Threads have been widely used as they are cheaper to create and maintain than a process. On uniprocessors, no locking is needed since only one routine can be executed in one time. In multiprocessor systems where threads access shared data structures, locking is necessary. The authors’ purpose is to analyze several locking techniques and compare their performance. More specifically, they are more interested in their impact on throughput and latency.

Proposed organization:

The data structures used by each thread are a program counter, a stack and a control block. The control block contains state information needed for thread management. Deallocated control blocks and stacks are stored in free lists. Finally, there are three data structures: the ready queue, the stack free list and the control block free list. To lock these structures we can use the following techniques: a single lock, multiple locks, per processor free lists and a central locked ready queue, a central queue of idle processors and free lists or per-processor ready queues and free-lists. As far as it concerns the spin-lock management we can use spin on Xchgb, spin on read or Ethernet-Style Backoff.

Contributions:

In my opinion, the main contribution of the paper is that it shows that small differences in management can have a significant performance impact. For example, by using one lock for all the shared data structures throughput is limited. By using multiple locks, one for each shared-data structure, throughput is increased but as more lock accesses are needed latency is also increased. As we can see from the results of the experiments, there is often a tradeoff between throughput and latency. In the paper five alternatives are discussed. Each alternative has different results on different workloads. An important contribution of the paper is that it showed that the use of per-processor data structures leads to better throughput. Another contribution, is the Ethernet- Style Backoff approach. Instead of using spin on Xchgb and spin on read, we avoid the problem of bus contention by having each processor estimate its likelihood of success before trying to lock. If the propability is low, the processor doesn’t try the lock. When a lock is released then every processor waits a period of time dependent on the number of past failures. That’s very important as a processor spinning over a lock affects the other processors too. Finally, the paper presents a nice way of modeling a system for an experiment.

Flaws
In my opinion a flaw of the methodology is that null threads are used to evaluate the locking techniques. I believe that the authors should have used a real workload. As far as it concerns the Ethernet- Style Backoff approach, the main problem is that it is not fair. A processor that has just arrived is more likely to acquire the lock than one that has been waiting and failing for some time.

What it makes easier

By using the locking techniques that use per processor data structures, we can improve throughput without increasing latency dramatically. Moreover, the use of the Ethernet-style Backoff approach can lead to a better performance as processors that do useful work won’t be affected by a processor that spins over a lock.

What it makes harder

By using the Backoff approach , there is the possibility of indefinite starvation, as the algorithm is not fair.

Posted by: Avrilia Floratou | October 7, 2008 01:50 AM

Summary: This paper explores the design space for threads management in shared memory systems, as well as various influencing factors. For fine grained threads they show that data structures and simple algorithms make a big difference in overall performance. They point to the tradeoff between latency and throughput and offer a backoff algorithm for resource contention.

Problem: Processes are too expensive for exploiting fine grained parallelism. Threads have no context switch, but a naive management also pays a high overhead price. There are many factors that have to be considered starting from high level issues such as preallocate data or scheduling to hardware details such processor count and the influence of cache coherency on locking primitives.

Contributions: There are three main contributions of this paper:
(1) shed light on the multitude of dimensions affecting thread management: on sources of bottleneck and necessary tradeoffs.
(2) offer a few solutions for thread management, including the interesting concept that "work searches for processors".
(3) show an analogy between cache invalidation and Ethernet collision and obtain a backoff algorithm for resource contention.

Organization: A process' threads share the address space. A thread is defined by a control block (including a PC) and a thread local stack. Preallocated some of these saves some time, hence the use of a free list for control blocks, stack free list and a queues with threads ready to run. These are shared resources, that need to be protected by locks. Using only one lock for all resources would be simple, but would serialize thread management, hence very inefficient. Using distinct locks for distinct data alleviates this, and increases throughput, however at the cost of latency: more lock accesses would be required per thread. This can be improved by maintaining lists of local resources (free lists and, eventually, ready queues) that require no locks and a global pool that would require locks, but would be used infrequently based on a threshold for the local count. Furthermore, for parallelism, they use a global list of idle Processors that can be used by thread creation.
They use 'xchgb' as a locking primitive coupled with an Ethernet-Style Backoff for spin waiting.

By using per processor resource lists makes it easy to exploit locality but it makes the code more complex. It is also more difficult to ensure proper balancing.

Weaknesses/Flaws: It is unclear what is the interface to this thread package and how much of the details are exposed to the user. Can the user see the spinning algorithm? Getting a random number for the backoff can be expensive. Also, this backoff seems a fixed policy that the user may not like (maybe because of the lack of fairness guarantees or because they have only 2 processors).

Posted by: Daniel Luchaup | October 7, 2008 01:27 AM

Summary
The purpose of this paper is to study the performance implications of the thread management alternatives for shared memory multiprocessor and present an efficient thread management package, Ethernet-Style backoff algorithm to improve performance of waiting processors.
Description of problem they are trying to solve
Programs on multiprocessors use threads to exploit parallelism. Thread package plays an important role in the performance of processors, thus a small change in organization of data structures and locks can have significant impact on performance. This paper is trying to evaluate the algorithm that would overcome the unwanted performance degradation.
Another issue they are looking at is, algorithm used to queue for locks. In general, a thread that tries to acquire a lock either spins until the lock is released, or relinquishes the processor. Spinning is the only option within the thread management routines, and it costs not only the spinning processor but other processors doing work as well (due to bus contention). They present an Ethernet-Style backoff algorithm that eliminates those effects.
Contributions of the Paper:
1. Keeping all the data under a single lock, limits the throughput due to high contention of locks. Per-processor data structures can be used to improve throughput; if a resource is not scarce, localizing data can avoid locking, improving latency as well.
2. Multiple locks for all the centrally placed data structures (ready queue, stack free list and control block free list), involves separate locking for each enqueue and dequeue operations. Throughput is increased but so is latency.
3. Adding complexity results in a noticeable increase in latency. Adding even a few instructions impacts the performance. Idle queue strategy checks for idle processors on thread creation. If the idle queue is always empty, it defaults to a normal ready queue. Even this simple check markedly increases costs in threads.
4. Throughput can be further increased by dividing the load under single lock into several locks. The idea is to keep a ready queue per processor. In this way, enqueueing and dequeueing can occur in parallel, with each processor using a different queue. They can be really inefficient, need each idle processor to scan other processor’s queues.
5. Spin on Xchgb and Spin on read, makes processors slower as they both let spinning processor’s cache copies to be invalidated at lock free, causing high contention in bus and to working processors.
6. Spin-waiting can not only delay the processor waiting for a lock, but other processors doing useful work. This appears to be independent of cache coherency protocol.
7. The cost of spin-waiting can be reduced by using Ethernet-style backoff or queue based algorithm.

Flaws:
The evaluation is limited to test environment of million null threads, real world results might not be same.
What it makes Easier:
It makes it easier to identify the thread management alternative, that’s best for your requirement. It makes it possible to implement fast thread package.
The bus contention is reduced due to introduction of Ethernet-style backoff algorithm, as each processor maintains its likelihood of lock acquire, and tries when the probability is high.
What it makes Harder:
Ethernet-Style algorithm isn’t fair. A processor that has just arrived is more likely to acquire then one who’s waiting. Thus, there’s a likelihood of indefinite starvation. And it takes longer for a spinning processor to acquire a newly free lock. The processor must check the value, delay and check it again before trying the lock.

Posted by: Rachita Dhawan | October 7, 2008 12:46 AM

Summary:
This paper describes various alternatives to thread management by using spin-locks for shared memory multiprocessors and describes the various alternatives of spin-lock management. It also proposes an Ethernet style backoff algorithm to reduce the cost of spinning for the locks.

The problem the paper was trying to deal with:
The main goal of this paper is to develop a package by using spin-locks to obtain the good balance between low latency and high throughput by serializing the access to shared data structures in a parallel environment for shared-memory multiprocessors.
Contributions
1． This paper makes some contributions on the analysis and measurements of thread management approaches on shared-memory multiprocessors, which include single lock, multiple locks, local free-list, idle queue and local ready queue. It also describes the potential advantages and disadvantages for each approach of these alternatives.
2． The paper describes some common problems related with spinning locks for some existing techniques like Spin on Xchgb and Spin on Read, and then proposes an Ethernet-style backoff algorithm which greatly improves performance with moderate numbers of waiting processors.
3． The paper presents a simple queuing network model for the thread package to show that the measurements can be influenced by the combination of processor degradation caused by bus contention and the effect of lock contention, and then a validated model is proposed.
Flaws:
One of the flaws of the Ethernet-style backoff algorithm is that it takes longer for a spinning processor to acquire a newly free lock. And the processor degradation is still the problem for the backoff algorithm. Another flaw is that the bus contention and the effect of lock contention can cause the processor degradation.
What it makes easier:
(1) The global pool and a threshold T on the maximum size of each list make it easier to balance free list, and by using free lists, both allocation and deallocation can be done more easily. (2) The Ethernet-style backoff or a queue-based algorithm makes it easier to reduce the cost of spin-waiting.
What it makes harder:
The technique of fine-grained parallelism on shared-memory multiprocessors and the complexity of the Ethernet-style backoff algorithm make the coding job much harder.

Posted by: Tao Wu | October 7, 2008 12:43 AM

Summary:
Prior to 1989, thread packages had become an integral component of new languages and operating systems (Mach, Topaz, Presto, Mesa, etc.) This paper examines the performance implications for thread management both empirically and analytically. The two-take away messages of this paper are - per-processor data structures can improve throughput significantly and an intelligent algorithm for resolving lock contention can significantly reduce the cost of spin-waiting.

Problem:
This paper studies the problem of understanding the performance implications of thread management and spin-lock management alternatives for shared-memory multiprocessors. The performance metric studied in the problem, is throughput and latency of thread management for fine-grained parallelism. Another important goodness metric studied in this paper for spin lock management alternatives is the relative processor speed. The results reveal that per-processor data-structures and an Ethernet-kind of backoff algorithm can improve the performance significantly.

Contributions:
This paper shows that small differences in thread management schemes can lead to significant performance gains, particularly in terms of throughput and latency. This is true especially for applications requiring fine-grained parallelism. The authors support this claim by comparing five different thrad management alternatives - central data structure (single lock and multiple locks), local free list (per processor), idle queue and local ready queue. For different scenarios (based on number of threads, number of processors, etc.) different alternatives work best. However, one of the important findings of the paper is that per-processor data structures can definitely improve throughput by using local ready queues.

Another major contribution of the paper is an Ethernet-style backoff algorithm (queue-based algorithm) for resolving lock contention. The paper highlights the problem that a processor busy with a thread spinning over a lock can impact the relative performance of the other processors. This is particularly due to bus contention. The backoff algorithm is able to accurately predict the effect of a combination of factors on the performance of shared memory-multiprocessors based on the history (number of past failures to acquire the lock).

Flaws:
The synthetic benchmark used by the authors for evaluating thread management alternatives is overly simplistic. All their measurements use the same underlying Sequent multiprocessor and standard compiler. Some of their results may represent the artifacts of this multiprocessor, though they are successful in proving the independence from the cache coherency protocol. As also memtioned, the authors didn't extend their results in the context of real application programs running on multiprocessor requiring fine-grained parallelism.

The Ethernet-backoff algorithm, is known to be unfair and may lead to potential starvation. This problem is noticed by the authors, but not well taken care off. Another thing which I observe is the use of only primitive thread management alternatives. The authors only mention to examine the hybrid schemes as part of their future work (which had started getting popular at that time).

What it makes easier?
The paper presents the performance implications of thread management alternatives for multiprocessors in a very clean manner. Analytical results based on queuing models elevate the significance of the empirical results. This paper (SIGMETRICS '89 - best paper) and one of its successors on scheduler activations (SOSP '91 - best paper) have laid foundation work for thread management for shared-memory multi-processors (Ph.D. thesis of Tom Anderson). Performance of parallel applications have been benefited from this work to a large extent due to efficient thread management.

What it makes harder?
The Ethernet-backoff algorithm needs to maintain history to predict the future. Moreover, these algorithms are known to be unfair and may lead to starvation. Lock fairness is much more critical to applications in a real-time system, as compared to the networking context of these algorithms. I believe these spin-lock management alternatives need to be revisited according to satisfy the hard requirements of the system being used.

Posted by: Mohit Saxena | October 7, 2008 12:43 AM

Summary:
A fine-grained Parallel program creates a large number of threads or uses threads that frequently block and unblock or both . Thread management cost is major obstacle to fine grained parallelism. This Paper study five strategy with their advantage and disadvantage in term of latency and throughput. In second half of paper three different approaches to spin-waiting is introduced. Different measurement graph are also provided in the paper to compare the performance of each approach.

Problem:
On uniprocessors, Thread Management module need not worry about concurrent access so no locking inside thread management module is required so overhead of thread management is limited. In case of multiprocessor access to shared data structure of Thread Management module (Ready Queue, Stack free list and control block free list) must be serialized. This serialized access increase latency and reduce the throughput of thread management module. In this paper Author analyze different approach to manage/protect these shared data structure.

Contribution / Work Summary:
1. Locking each data structure separately increases latency than locking all data structure under global lock. Global lock reduces the throughput as most of the thread management path is serialized. Using per processor data structure reduces latency without compromising with the throughput.
2. In case of local free list of Stack and Control block, global ready queue limits the throughput. Local ready queue support higher throughput but that might cause indefinite idling of a processor. Indefinite idling can be avoided if process scan ready list of other processor but this will again cause contention incase single processor enqueues .
3. Thread management routine must be kept simple as additional complexity increase the latency.
4. Using idle queue (queue for idle process – In this approach work searches for processor instead of processor searching for work ) is faster if there are idle processor but slower when there are more runnable thread than processor.
5. One more advantage of local ready queue is that there is more probability that one thread will be executed on same processor once its resume. This will result in less cache miss as there are more chance of processor data will be in same processor cache.
6. Spin-locking alternative is to block the thread and perform some other useful work while lock is busy . This looks nice approach but in real scenario context switch is more expensive than spin-locking as context switching involve more overhead.
7. The cost of spin-waiting can be reduced by using Ethernet style back-off algorithm which is kind of unfair and new processor might acquire lock before the processor already waiting for lock . Solution to this queue based spin waiting algorithms.

Flaws:
1. Paper show measurement of each strategy on 1million null thread, which is nowhere, related to real fine-grained parallel application.
2. Paper do display that if each thread do 300µs user work then speedup pattern is different then case when thread is not performing. More information about the dependency of speedup on user work is missing from paper.

What It Makes Easier/Harder:
As we make data structure local to processor. This reduce the overhead of locking global list so make thread creation/resumption easy. But special case handling like indefinite idling require scanning of other processor queue which increase the complexity of code thus increase the latency.

Posted by: Nikhil Teletia | October 7, 2008 12:13 AM

Introduction
This paper discusses multiprocessor performance implications of threads which are lightweight processes with minimal structuring. Performance features of thread data management list implementations and locking mechanisms were examined showing the tradeoffs.
Problem trying to solve
With the evolution of multiprocessor systems, it has become more important to reduce the expense of processes and reduce the overhead of sharing data among processors. However, there are trade-offs to consider when using different locking and thread queuing mechanisms making it difficult to maximize throughput and minimize latency.
Contributions
This paper contributes the analysis and measurements of thread execution techniques on multiprocessors. The measurements of single lock, multiple locks, local free-lists, idle and ready queues, in terms of latency and speedup show optimal behaviors for different critical section sizes and number of processors. In addition, porting and testing the Ethernet style back-off algorithm in a threaded multiprocessor to compare the performance.
Flaws
All measurements using the Ethernet-style back-off algorithm ignore the fact that lock fairness is in play. It may be possible for some of the threads to be starved yet this aspect is not measured. For example, the more a thread does not get access to a lock, the longer wait time reducing execution likelihood. In this respect more real world test sets should have been taken to show the real performance differences. Also, in stating that they have validated their analytic model, no results are shown besides speedup vs. number of processors; more tests should be performed to ensure the analytic model is really representative.
Organization
Threads are organized as lightweight processes that each have a program counter, a stack, and a control block. For thread management, there are three data structures: the ready queue, the stack free list, and control block free list. These data structures can be organized with a single lock which limits throughput, multiple locks which increases throughput and trades off latency, a local free list for lower latency but better throughput, an idle queue causing reduced latency when idle processors exist and increased latency when all are busy, and local ready queue reduces cache misses but idle processors must search for work increasing latency. Locking styles examined are spin on xchgb, spin on read, and ethernet style back-off. Back-off trades off lock fairness and longer time to acquire a free lock for less bus saturation of lock misses allowing for higher throughput.

Posted by: Cory Casper | October 6, 2008 11:22 PM

Summary
In this paper, the authors are mainly concentrating on various
techniques to manage thread management locks and implement these locks
in a shared memory multiprocessor. They also compare and contrast the
various methods to show which techniques are suitable for specific
kind of loads and their various tradeoffs.

Problem
The authors want to build a package that allows fine-grained parallelism that can
be exploited by shared memory multiprocessors. Unlike uniprocessors,
locking and bus contention becomes a bottleneck in a multiprocessor. Also providing coarse-grained parallelism at the process level(DYNIX) is inefficient because of the heavy initialization overheads.

Contributions
- They provide multiple possible implementations of thread synchronization across multiple processors and perform a thorough analysis
on the tradeoffs(latency/throughput) for each implementation.
- They keep on decentralizing each shared data structure(control
block, stack, ready queue) to increase the latency/throughput at the
cost of memory overheads. Their local ready queue implementation
finally moves the ready queue and the control blocks to each
processor, which reduces the contention between the processors but
increases the memory overhead and latency.(More locks per thread
creation)
- They have reduced the memory footprint of their package by
initializing the stack only after thread startup. So all unstarted threads use very little memory.
- They also discuss various techniques to implements the spin locks and propose their own solution based on ethernet-style back-off. This
method backs off exponentially after every failed attempt which
reduces the bus contention and repeated cache invalidations.

Flaws
- Having 2/4/8 processors on a single board are more common than having 20+ processors. None of their methods seem to provide any major
improvements for smaller number of processors as compared to the naive single
lock technique.
- The ethernet based exponential back-off is biased to the processes that
have just arrived rather than the waiting processes. Also even after most of the processes have acquired the lock and finished processing, the other processes still have a long wait period which is shown in the "barrier synchronization"
experiment.

Design

Easier
- Queues of idle processors is a very useful construct for systems
with fewer threads than processors as assigning jobs to processors is easier than each processor searching for jobs because of local
spin-wait flags.

Harder
- As the amount of decentralization for each method increases, the
complexity of the code also increases. Also to get the current state of the system(for e.g. number of queued processes) would require
checking the state for each processor.

Posted by: Tushar Khot | October 6, 2008 11:20 PM

Introduction:
This paper discusses various issues related to thread management on multiprocessor systems. The paper tackles two main design decisions which have a huge impact on the performance of the system: the thread management data structure organization and the method of acquiring locks.

What were they trying to solve:
On multiprocessor systems, processes are unsuitable for achieving fine-grained parallelism. This leads to the idea of lightweight processes or "threads" which encapsulates the notion of an executable unit. For any thread-based system to be efficient, the time taken for thread operations (creating a thread, switching between threads etc) should not be a large overhead on the system. This is especially hard to do in multiprocessor systems, since true concurrent access to shared data is possible. The authors aim to design a thread package which has low latency and high throughput.

Contributions:
The paper identifies two major factors which affect performance of the thread package:

Granularity of locking on common thread management data structures: The authors give a detailed analysis of five different organizations; starting from a global data structure with a single lock, and ending at per-processor lists for each different data structure. The main theme is that having a decentralized per-processor structure with small locking granularity increases the throughput of the system, but at the cost of additional overhead. The author specifies the cases in which each organization would be suitable.
How the locks are obtained has a significant impact on performance:

Spinlocks(busy wait locks) waste CPU cycles in waiting, but they are preferred over blocking and unblocking since the wait times for locking thread data structures is usually low, and blocking would cause cache-misses etc due to context switches.
The simplest way to get a spinlock is to use test-and-set instructions. This has a disadvantage of bus congestion.
Instead of reading the lock bit from memory, the processor can read from the cache. When some other processor releases the lock, the cache content is invalidated, and then normal locking procedure can be used.
To prevent multiple processors synchronously trying to get a lock, when obtaining a lock fails, the processor waits for a random amount of time before trying again, like in the Ethernet backoff protocol.

The paper also has numerous optimizations to decrease latency:

When processors are idle, do some part of thread creation and allocation beforehand.
Keeping the control block minimal has significant impact on performance.

Flaws:
Bus contention is the bottleneck they face in the Symmetry system, which has 20 processors sharing a central memory. Miltiprocessor systems now-a-days tend to have private memory for each processor, so that the bus load isn't too significant. Some of their analysis would change significantly if bus contention was not a major issue.
The measurements are not based on realistic workloads. It would have been better if they had run some real-world benchmark to test the performance of various designs.

OS Organization:
The paper advocates decentralized per-processor thread management structures: Each processor maintains a list of free control blocks, a list of ready threads and list of free stacks. Any locking is done via spinlocks on read, with random backoffs to minimize repeated conflicts.
Advantages:
There is increased throughput compared to centralized locking, since concurrency is increased by not having to wait for other processors to release lock on global data structures.
Disadvantages:
There is more overhead because of balancing the resources and threads between different processors.
The back-off algorithm is unfair and favors new processes. Also there is no way to implement priority or other policies into this process.

Posted by: priyananda | October 6, 2008 10:37 PM

Summary: The authors of this paper evaluate several thread management mechanisms based on different performance metrics in order to find the most effective thread management scheme for multiprocessors. They also describe an Ethernet style backoff algorithm to prevent wasted processor cycles because of spinning locks. Problem to Solve: Threads have introduced the ability to parallel process work to multiprocessors, but successful parallel processing requires efficient locking and scheduling mechanisms. As the number of threads and processors grow the importance of these mechanisms increases and thus, the authors are trying to evaluate and present the best mechanisms for use on multiprocessors.
Contributions: The authors detail several current data structures and algorithms for thread management, and evaluate each of them based on 1,000,000 null thread instances. General weaknesses and strengths are discussed as part of this evaluation.
The paper presents some of the problems associated with spinning locks and presents the Ethernet-Style Backoff algorithm as an alternative for spin-lock management. They implement the algorithm and evaluate it against spin xchgb and spin read.
The paper shows that effects of spin locking on useful thread processes as well as ones that are spinning.
Flaws: One flaw of the paper is that most of the evaluation is done on 1,000,000 null threads. Although this benchmark makes their evaluation consistent, null threads are not necessary representative of how non-null threads could perform in certain cases.
Advantages of Proposed OS: One advantage of this research is that it evaluates thread management schemes on several different performance metrics; making their conclusions about the best thread management scheme for multiprocessors creditable and comprehensively evaluated with regard to many of the challenges each of these mechanisms face. Another advantage is that they show how small changes to data structures can drastically impact performance and how to reduce the effects of spin-waiting. Using localized data structures provides a way for processors to handle multithreads in organized queues, while ensuring balancing via a scheduler. This helps reduce latency and increase throughput by reducing the critical section held by individual locks and reducing the time required for a thread to find an open processor to run on.
Disadvantages: One disadvantage is that the Ethernet-Style back off algorithm suffers from some of the same problems as Ethernet back off does in networking context. Threads which gain control can almost starve other threads because when a collision of a lock occurs other threads increase their back off counter. Also once a lock is released, it may sit idle until a timer of a thread is up and then all locks must be double checked before a processor actually is allocated.

Posted by: Holly Esquivel | October 6, 2008 10:32 PM

Summary
In “The Performancce Implications of Thread Management Alternatives for Shared-Memory Multiprocessors”, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy describe a variety of methods for accessing shared resources between threads through the use of locks and present the performance implications of these varying schemes.

Problem
Using additional hardware to run a program written to solve a computational problem in order to speed up that program’s execution is a logical action. However, due to the need for synchronization of data, these extra processors must collaborate and this overhead of collaboration can sometimes lead to poorer performance than we would hope for. This is shown to be quite a problem when dealing with various locking schemes to facilitate resource sharing.

Contributions
• Apply the Ethernet-style backoff algorithm to lock acquisition and showing experimental results that indicate this method scales better than other lock acquisition approaches.
• Presenting an overview of a variety of locking schemes in use and describing their potential pitfalls.
• Showing through experimental observation that locking approaches described do not scale very well and that the Ethernet-style backoff algorithm performs better.

Flaws
• The experiments were run purely on benchmarks, which are not always indicative or real workloads—it would have been useful to see how the various schemes performed under real workloads.

Organization
First, a variety of locking schemes were presented and how they were used and their bottlenecks were presented. After this, a new method for improving performance of synchronization code was introduced. This method alleviated the load on a lock which occurred when multiple processes attempted to access the same lock by decreasing the frequency of access attempts when a lock could not be obtained. By scaling back requests, the problem was not intensified as more and more processes also attempted to obtain a lock.

Posted by: Mark Sieklucki | October 6, 2008 08:52 PM

Paper Discussion - Fall 2008 - CS 736

The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors

Comments

Post a comment