CS 739 Reviews - Spring 2012: TreadMarks: Shared Memory Computing on Networks of Workstations

« Chord: A scalable peer-to-peer lookup service for internet applications. | Main | Petal: Distributed Virtual Disks »

TreadMarks: Shared Memory Computing on Networks of Workstations

C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, TreadMarks: Shared Memory Computing on Networks of Workstations IEEE Computer, Vol. 29, No. 2, pp. 18-28, February 1996.

Review due Tuesday, 3/20.

Posted by Michael Swift on March 19, 2012 01:35 PM | Permalink

Comments

The authors want to develop a shared memory system for processes running at different physical locations without the creators of the process code having to be too aware of the distributed nature of the memory. Treadmarks is a simple system where users request pages they're interested in and push the writes out to the rest of the network.

The authors offer two primitives: locks and barriers. Locks are used to directly protect critical code and are typically in the learning process of most programmers. Barriers are slightly different. They are one way to make distributed computation easier to reason about. By demanding processes periodically pass through a barrier, computation is broken up into rounds as all processes must meet at the barrier before any can continue. Thus any one process can have a pretty good idea of the state of another without directly querying it simply because the timing is now similar across nodes and the piece of an algorithm implemented at each node is generally similar in a round. However, this comes at the cost of slowing overall performance as each node can only move at the pace of the slowest. Without a failure detector to decide whether a node is running slowly or has died, delays could be considerable. That's why interest in the distributed algorithm community has focused more on wait-free asynchronous computation and not so much on synchronous algorithms.

The authors offer an optimization on performing writes. Instead of pushing a write to all nodes immediately, the writing node simply informs all other nodes that a write has occurred on that page so they can invalidate the old copy in their local cache. Now when a node wants to write or read the page, it asks the writer to send a diff containing the changes made. This cuts communication down to only those messages that are necessary to be made. In the case where all nodes are interested in the same page, it will be less efficient as multiple messages have to be sent and an added round-trip time will be added to the delay. However, we expect that in the common case there will be significant savings. As the programmers should expect multiple processes running on the same machine and all that's being hidden is their spread over multiple physical locations, the code should be constructed in such a way that multiple writes to the same location are guarded against. As such, this system doesn't defend against them or offer a mechanism for resolving conflicting writes.

I think this is a decent system. Not too ambitious in its scope and fairly simplistic but offering a decent platform for moving multi-threaded computation to multiple machines. There are a few questions that I didn't get resolved. Is the system assuming a static number of nodes known initially? If a single node computation spawns a new thread, can it be moved to another location in the network? There's no membership protocol. Second, as an abstraction of single machine memory, do we assume that the address space is small enough for one machine to hold it all in memory? There doesn't seem to be a method added on to offload changes in the event of a cache overload. If we're not using the additional memory offered by multiple machines, do our savings simply come down to the speed up off by multiple cpu's (offset by network latency)?

Posted by: Brian Nixon | March 20, 2012 08:41 AM

The paper is about implementing a Distributed Shared Memory (DSM) on a network of workstations that uses lazy release consistency to achieve efficient execution of any data race free parallel program.

A DSM system requires to provide virtual memory abstraction beyond a single compute node. That is, it should provide a shared global address space that can be used as if it belonged to a single system. The concurrent execution of multiple programs on this global address space adhere to some memory consistency model or guarantees that are simple enough for ease of programming (sequential consistency). But, it turns out that such simple (but stronger) memory consistency models trades ease of programming for performance. This paper investigates the possibility of improvement with weaker consistency models like release consistency.

The authors implemented a DSM system called TreadMarks which provides a global address space at the user level by utilizing the existing paging mechanism. They use the same approach used by Ivy DSM system which used page fault on a global shared memory to get the latest copy from elsewhere in the system. Hence, effectively making the local main memory as a cache of the huge global address space spanning across the network of workstations. TreadMarks guarantee a weaker consistency called release consistency since they identified guaranteeing a stronger consistency is detrimental to application performance. Release consistency memory systems rely on synchronization during shared memory accesses as a hint for triggering memory consistency management (making sure next read returns the latest write). Here the synchronization primitives considered are barriers and lock acquire/release. The major contribution of this paper is the idea of "lazy" release consistency. Lazy release consistency utilizes the fact that the memory subsystem needs to assure consistency only at every lock-acquire rather than on every lock-releases (eager release consistency) or on every reads and writes (sequential consistency). This lazy release consistency reduces the number of invalidations sent to other nodes by using the fact that only the next processor that is accessing the shared memory (page) needs to be updated rather than the rest of the world (though, barrier requires all the processors' memory involved to be consistent unlike acquire/release). To implement lazy release consistency efficiently they propose various novel ideas like (1) maintaining two version of a page on a write (to circumvent false sharing and ping-ponging) and (2) maintaining just the modifications (diffs) and communicating it across processors. The above ideas are other major contributions of the paper since these solve very critical performance problems like false sharing which is very likely to cause the ping-pong effect (slowing down the system) when the unit of allocation is as large as a page and under high concurrency. The illustrations in the paper with real benchmarks shows how easy it would be use TreadMarks and the overall speedup realized in those benchmarks.

The paper seems incomplete without any discussion about "naming" problem in a DSM system. That is it doesn't answer these questions. How is the huge global (virtual) memory addressed in the program? How is the wide address requirement handled in the OS? How would two concurrent process identify if the page that they are accessing is actually the same page? To be precise, what does the Tmk_malloc() return? Naming is a huge challenge in any DSM system which is not addressed in this paper. Also, the weak consistency model strictly requires the programmer to synchronize every shared memory access (even reads). This is too restrictive and probably the system needs to support reader-writer locks or other forms of synchronizations that may come in the future (!).

This paper provides many optimizations and novel ideas to implement a release consistency DSM system which is still valid and useful in today's systems. It is more so because of the advent of many multi- and many-core processors with banked memory modules.

Posted by: Venkatanathan Varadarajan | March 20, 2012 07:48 AM

This paper is about TreadMarks, a system for distributed shared memory.

TreadMarks works by providing an API to programmers to allow them to allocate and share memory pages between processes running on potentially different machines. It also provides two types of synchronization: barriers and mutexes. The core insight on which TreadMarks is built is that correctly written concurrent programs usually surround access to shared memory with one or another synchronization primitive calls. For example, if multiple processes are modifying the same bytes, they will use a mutex to prevent race conditions and overwriting each other's changes. If a program does not do this, it is usually indicative of a bug.

TreadMarks uses this insight to implement lazy release consistency. When a process modifies a shared page, it faults and TreadMarks creates a twin of the page. After the process calls a barrier or mutex release, TreadMarks creates a diff of any modified pages and notifies other processes that those pages have been modified. When a process reaches a barrier or mutex acquire, it notifies other processes and, if it has previously been notified of a page modification, requests diffs of those pages and applies them to its own memory.

One flaw in this paper is that the service, while it makes gains on eager release consistency, still involves potentially every process communicating with every other process. It would be interesting to explore ways in which this could be alleviated, such as by using an auxiliary service or overlay network.

It also seems to me that TreadMarks will not eliminate all false sharing. Suppose you have several processes modifying different portions of a page, each of which is protected by its own mutex. It may well be that, for several releases, there are many pairs of processes that never modified each other's portions. However, under TreadMarks, they would still have to send diffs to each other.

Another flaw (perhaps more of a limitation) is that TreadMarks would only work for algorithms that use only locks and barriers for synchronization. Some modern parallel algorithms use hardware, ad hoc, or some other form of synchronization, depending on application.

I also don't quite understand the definition of the partial order relation at the bottom of page 11. It seems to me that this definition is circular. Under the second bullet point, they define happens-before-1 for a release and 'corresponding' acquire, but when defining 'corresponding', the authors say that "an acquire corresponds to a release if there are no other acquires or releases on that lock in between." But I think "in between" is itself a temporal notion here. So either they define happens-before-1 in terms of real time, or they have a circular definition. Neither case seems right to me.

Posted by: James Paton | March 20, 2012 07:48 AM

This paper discusses TreadMarks, which is a distributed shared memory system used on networks of workstations. The new system alleviates the difficulty of developing parallel programs, and improves performance of programs over this system.

The goal of the paper is to design a system that allow processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. This system faces several challenges. Consistency model plays a central role if the system replicates data in the implementation. Sequential consistency is generally viewed as a natural consistency model. But it introduces extensive communications, and as a result, large number of data exchange is not practical. Also, with sequential consistency, there are several implementation problems. One is that, in some situations, it is not necessary to invalidate all the variables distributed on other workstations. The second one is false sharing, which occurs when two or more unrelated data objects are located in the same page and are written concurrently by separate processors. This would lead to “ping-pong effect”.

Thus, the main contribution of this paper is the solution to the above challenges. The paper proposed a new consistency model, which is release consistency model, together with a new invalidate protocol called multi-writer protocol. The key idea behind release consistency model is to delay the information of modification to the time it is necessary. Synchronization is introduced in a shared memory parallel program to prevent processes from accessing certain memory locations before the synchronization operation completes. There are two algorithms for release consistency model. One is lazy release consistency algorithm, which enforce consistency ar acquire time. The other one is eager release consistency algorithm, which enforces consistency at release time.

The invalidation protocol, multiple-writer protocol allows multiple processes to have a writable copy of a page at the same time. When synchronizing, other processes are informed about the modification. And the new value is gained by the access fault and the diff applied to the page. This technique reduces the effects of false sharing and significantly reduce overall bandwidth requirements.

One flaw of this paper is the lack of comparison of other similar implementations of distributed shared memory systems. The evaluation part of the paper shows the implementation is able to provide near-linear speedup for the applications, which is a really good result. But it fails to show how the strategies adopted in this system benefits the system compared to other approaches. It would be better to see how these strategies benefits in real applications.

In sum, the paper provides an interesting DSM design and implementation. The results show that the system can be used in practical for specific applications. Shared memory model is also simple for programmers to implement parallel algorithms. Thus, I think the system is still applicable and valuable in real world.

Posted by: Xiaoyang Gao | March 20, 2012 04:11 AM

TreadMarks: Shared Memory Computing On Networks of Workstations:

In treadmarks, Amza, Cox et al present a new distributed memory system called TreadMarks. Among it's major contribution are that it offers a new service without overly changing the semantics of common programming languages ( posix threads was defined in 1995, well before this paper was published ) and it's lazy release protocol that eliminates a significant amount of communications overehead.

Flaws:

It is unclear how treadmarks handles it's 'diffs'. They mention that somehow diffs are retreived from a 'minimial set of remote machines'. Some more information about how the consistency model works would have been appreciated.

There is an issue with failure recovery. If one processes running on one machine segfaults how is that sefault propagated to all of the other processes ( especially if treadmarks is running user space ). In general fault tolerance is going to be a problem. How can this system survive if one node goes down?

The goal was to not change the semantics of a program, but treadmarks doesn't quite manage that. By using the posix signals they require the user not to block signals and double check that all system calls that can be interrupted by a signal be restarted. Programmers do not always do that. Another semantic that maybe new, but is also missing is the idea of seperating locks into read / write locks. A possible solution for solving write ordering is to assign a simple counter to a lock. Acquire a lock and the counter would increment. If a program acquires a lock and writes to a page that counter could be used to impose a possibly flawed, but at least consistent ordering of what write happened when.

Debugging a program running on treadmarks would be difficult. There was no mention of any utility, be it gdb or something similar.

It would have been interesting to hear about any results in speculative pre-fetching pages instead of waiting for a lock.

Is the overhead of running diffs on pages of memory worth the reduction in work for the programmer? Inventing new programming semantics like MPI would avoid the diff overhead. It appears that their semantics are similar to MPI. It's unclear as to what kind of speedup is offered by treadmarks as all of the axis labels are blurred in the provided copy of the paper. A point in favor of MPI over treadmarks is that MPI potentially has a much larger available address space as it does not impose a global address space for communications ( a scalability problem ).

Posted by: Matt Newcomb | March 20, 2012 03:38 AM

Victor Bittorf, Igor Canadi, David Capel, Seth Pollen

The Treadmarks system implements a shared-memory distributed communication abstraction on top of a traditional message-passing system. This makes it easier to port existing multithreaded applications to run across several nodes, since programmers can rely on familiar memory semantics rather than asynchronous message passing. To make the shared-memory model efficient, Treadmarks relaxes the consistency requirement, guaranteeing sequential consistency only for programs that are free of data races. This allows Treadmarks to defer the updating of memory pages until it encounters a synchronization operation (such as a barrier or lock operation). While Treadmarks provides an elegant and useful model, it shows very little fault tolerance or dynamic scalability, which may make it unsuitable for modern large-scale distributed systems.

In Treadmarks, changes to shared memory are communicated only when synchronization operations occur, and the changes are sent only to the nodes that need to see them to preserve sequential consistency. The paper is a bit unclear on how the decision of when to send updates is made, but as far as we can tell, the synchronization primitives (locks, barriers, etc.) keep track of which nodes have used them recently, and this information is used to determine which nodes to contact to receive memory updates when the lock is acquired or the barrier lowered. For, example, a lock structure can maintain a record of the most recent node to possess it. When another node acquires the lock, it downloads all memory updates from the lock’s most recent holder. In this way, the holder of the lock always has the most recent values for all the memory locations protected by that lock.

As noted, the above paragraph only gives our interpretation of the paper, and we are not certain it is correct. On page 24, the paper states that a node pulls memory updates from a “minimal set of remote machines.” How this minimal set is computed is not given, and we consider this a major flaw in the paper. We suspect, as noted above, that it may be computed based on extra state carried with the synchronization object. However, this approach introduces its own problems, as it is difficult to determine the basis to use when computing a diff to send to another node. If nodes are not kept constantly up-to-date by the regular broadcast of memory diffs, how does a node know how far back in history to look when computing the next diff to be sent? It seems to us that the only real solution to this problem is to have nodes broadcast their memory updates to all other nodes each time a synchronization primitive is hit.

Treadmarks’s relaxed sequential consistency model performs well in the presence of false sharing, where two nodes frequently write to different memory locations on the same memory page. There is no need for such nodes to see each other’s updates, since they are operating on disjoint regions of memory. Treadmarks properly addresses this case using a multiple-writer protocol. This means that two nodes could simultaneously have writable copies of a given memory page, and the two diffs produced by these updates will have to be merged later on to return the system to a single global view of memory. Treadmarks punts on the issue of resolving conflicting concurrent memory writes, relying on programs to obey good locking semantics to prevent such racy updates.

Another flaw we found in this system is its lack of tolerance for server or communication failures. There is no provision for replication of writes across servers, so partitions and server crashes could result in the loss of data. Perhaps this system was designed for small, high-reliability compute clusters, making failure less of a concern. The apparent (implicit) reliance on broadcast communication also points to an assumption of small cluster size. We suspect that this reliance on a small, failure-free cluster restricts the applications of this technology to modern large-scale systems, though we expect that small-scale distributed computing still has its place in the world.

We noted that since this paper’s publication, memory size has grown at a much faster rate than network bandwidth (the network link used in the evaluation ran at 100 Mbps, but only 4 MB were allocated for storing diffs). Perhaps this explains why message-passing seems to remain dominant over shared memory today: the message-passing abstraction focuses developer attention on the true bottleneck (network bandwidth), thus encouraging economy on this scarce resource.

Posted by: Seth Pollen | March 20, 2012 12:26 AM

The paper summarizes the TreadMarks API that allows programmers to easily implement parallel programs with distributed shared memory; although the processes can run on different machines without actually sharing a physical memory. The API provides basic methods such as allocating a shared object, acquire/release locks on them, and wait for other processes (barriers). They show how one can easily implement parallel programs for hard problems, discuss the way they handle consistency, and show some experimental results.

Multiple processes can read/write a remote shared memory which obviously requires processes communicating with the source. This creates two problems: How to handle consistency and how to achieve high performance by minimizing massages. They used late release consistency and multiple writers protocols towards that end. I do not think these are novel ideas at that time (at least I can see a whole PhD thesis on lazy release consistency). But still, they make the API more efficient.

I believe the most important contribution of this work is the idea of having a simple API that allows programmers to implement parallel programs on a network of nodes. I believe the authors did a great job on this. The API is so simple and easy to use, furthermore it is so powerful. For example, they provided nearly all the C implementation of the TSP and Jacobi algorithms, one can see that it is very understandable and easy to code,yet one can expect they will provide significant speedups.

Significant portion of the paper is dedicated to applications. Although it looks like they are supporting results for the system, I believe especially the benchmark results on mixed integer programming (MIP) is a very significant contribution itself. Parallelization of MIP had been well studied by academia and industry before this paper published. Therefore, it is quite impressive that they were able to solve an open problem.

One shortcoming of the paper is that it does not mention failures at all. I believe it is because the nature of the problems they are trying to solve. One can simple rerun the test when there is a failure. But still it would be better if there were some words on failures and deadlocks.

Second shortcoming of the paper is that it does not mention scalability much. First question is how the performance will change when we increase the number of nodes in the system. One can expect the performance will increase, but I doubt that the increase would be linear. Second question is what happens when the shared data is too larger, or accessed a lot. It would be good to see some discussion on partitioning the distributed shared memory.

Providing an easy to use API that simplifies parallelization is a rising idea nowadays. Map-Reduce framework and other APIs on top of it such as mahout and pig have the similar idea. The search algorithms for optimization problems are not that powerful on Map-Reduce, on the other hand this paper shows that TreadMarks is so powerful on such algorithms. It might be interesting the investigate the reason for this difference. I suspect that it is because TreadMarks allows sharing objects among processes, so that each process can learn the best solution found so far and search for better; whereas Map-Reduce does not allow keeping global memory to which each worker can reach (I am not sure this is correct though).

Posted by: Halit Erdogan | March 19, 2012 11:40 PM

The paper talks about an implementation of distributed shared memory that
functions above the OS level and provides the functionalities of a shared
memory system in a single machine. The set up provides access to a shared
memory between a set of processes running on different machines within a
network. The system is acheived without making any changes to the OS kernel by
handling the access to the shared memory location through the treadmarks
system. The shared memory is replicated where the local copy of the pages
shared act as a cache to the distributed shared memory. These local pages are
updated and invalidated as required depending on the consistency model
adopted.

Since the memory is shared between multiple processes, this leads to
consistency issues that are guaranteed to the higher level applications. Also,
to co-ordinate concurrent access to any shared memory between processes, they
expose a barrier and locking mechanism. Memory is divided into pages and the
changes to any shared page is update to the other processes sharing the same
page. The authors explain the sequential consistency implementation and its
inherent inefficience in the form of excessive and unwanted communication.
From a simple observation that safe access to shared resources from
co-operating programs are done using the synchronization primitives like
locks, they lead the discussion to another consistency model called the
released consistency model in which, the updates are flushed around the
release and acquisition of synchronization primitives. They talk about two
forms of this, lazy and eager release consistency. Eager release consistency
flushes the updates to the current set of processes waiting on the lock once
the lock is released and lazy release consistency invalidates the cached copy
of the page and makes the next acquirer pull this on the subsequent access.
This is similar to the intuition of push vs pull in the epidemic replication.

I liked the approach of the authors to make to implement it through page
faults and signal handlers instead of providing the support by modifying the
kernel. This way, the system could be widely adapted and ported.

However, I find that the paper does not provide good intuition as to why a
programmer who is good at writing parallel programs through explicit message
passing would choose their shared memory model which can incur a lot of
additional overhead since the actual intent of what is shared and what is not
shared is not possible to be expressed at a finer granularity in the
treadmarks system. It appears that the treadmarks system makes it easier to
write parallel programming applications but does not guarantee any significant
performance improvements compared to traditional message passing libraries.
The authors talk about readers and writers to shared memory, however, the
treadmarks system only provides an exclusive lock. Why do they not have a
separate read and write lock which can help the programmer convey the access
pattern to shared memory in a better fashion. Also, they dont discuss their
implementation of either locks or barriers. The evaluation sections also donot
talk about how this compares to the same implementations which use explicit
message passing.

In general, the paper does not talk about how failures can affect the system
or how scaling affects the performance. From a distributed systems
perspective, these two are very important. Also, could it be possible to
take advantage of the fact that two programs using the treadmarks api reside
on the same physical machine?

Posted by: Sathya Gunasekar | March 19, 2012 10:09 PM

This paper introduces Treadmarks, a distributed shared memory(DSM) system. It uses lazy release consistency that makes changes to shared objects visible only when processes enter critical section so that less communication between processes is involved and better performance can be achieved.

DSM system provides convenient programming model. Processes that are not on the same machine can share address space so that the part of the virtual address in the all processes refers to the same object. This convenience also brings high costs of communication and maintaining consistency for data on different machines.

The main contribution is that it introduces lazy release consistency, which reduces much communication and promises data consistency under the assumptions that all used synchronization primitives must be provided by Treadmakrs and the program is free from data races. It is easy to see that under this assumption, all synchronization primitives and shared memory accesses would be totally ordered by happen-before relationship and accesses on different processes would be separated by synchronization primitives. So every access can see the latest result. Compared to eager release consistency that sends a message to every acquisition operations happen after the releasing operation, lazy release consistency only have communication between adjacent release and acquire operations. This further reduces communication costs. Another contribution is the multiple-writer protocol, which is effective to relieve false sharing. The protocol is still based on the assumption that there would be no data races so that concurrent writes to the same page would actually happen at different locations. Then applying diffs from different processes can result in a unique result, which also reduces communication cost.

The assumption that there would be no data races in correct programs is generally valid, but benign data races exist in some applications that pursue high performance. I am heard that there exists some benign data races in windows kernel code, but they are unwilling to fix them because it won’t bring much benefit and it is difficult to fix. So, not every valid program will be free from data race, which prevents some applications from using Treadmarks. Another flaw of this paper is that it doesn’t discuss security issues at all. It is important to provide certain level of isolation so that if one application causes memory corrupt, it would not corrupt other applications’ memory.

In all, Treadmarks greatly improves previous work and provides a convenient and efficient DSM system. But it may not be a very pratical DSM system for real use because it doesn’t provide safe programming model.

Posted by: Xiaozhu Meng | March 19, 2012 08:53 PM

Treadmarks is a system that provides an interface for programming a set of
networked computers as if they have a pool of shared memory to use, even if they
do not. The problem that this software is aiming to remedy is that parallel
programming is a difficult style of programming, despite its performance
benefits. The authors believe that simplicity is needed in a parallel
programming model, and that this is achieved by offering a familiar programming
model, shared memory, as opposed to explicity sending messages to each separate
entity in a network. This way, the details of communication can be abstracted
from the programmer and performance can be improved in the software layer that
provides the shared memory abstraction.

The main contributions of this article are some improvements on this idea that
benefit performance better than some previous offerings of shared memory
programming interfaces. Thus, this style of programming applied to networks of
workstations was not a new idea at this time, but perhaps it was not widely-used
because the existing frameworks could not deliver performance equal to that of
message passing frameworks. Specifically, the first contribution is a "lazy
release consistency" model that only informed the next entity that acquires a
lock of a change in data of the critical section protected by the lock, instead
of informing all nodes in the network. The second contribution is a
multiple-writer protocol that improves upon a "shared cache" style of multiple
processes writing to data. If more than one process is writing to a page of
memory, the Treadmarks system keeps a local "diff" of the original data before
either process wrote to the page. The diffs are then compared and merged, with
conflicting writes being indicative of a data race and something that needs to
be remedied. In this model, no data invalidation messages need to be sent, and
no process is delayed from writing the data it needs to write, so
synchronization is not used when it is not needed, and false sharing does not
occur.

The ideas in this paper are good, but I find them to be typical and obvious
solutions apparently applied to a new frontier of "networks of workstations".
Also, the addition of describing the Jacobi method and the traveling salesman
problem coded in their framework did not really add to the discussion at all for
me, and I wondered why they were there. I suppose as a demonstration of a
framework, but these details were very uninteresting to me. The speedup and
latency measurements were also nearly worthless to me unless I had worked with
some of those exact programs on very similar systems. Listing every single
operation and how many microseconds it took is a flawed way to present
performance in my opinion. I came away knowing nothing on how this stacked up
against similar techniques (which obviously existed from their prose).

This article is useful for designing new systems in deciding what should be the
dominant programming model that we would want to work best with the system.
Shared memory on a system where there is no such thing offers good simplicity
for programmers and it is worth considering how this will run on a designed
cluster.

Posted by: Evan Samanas | March 19, 2012 08:44 PM

In this paper, the author described the TreadMarks system, a distributed shared memory system.

The motivation of the paper is to make use of the high speed network and rapidly improving microprocessor performance to achieve parallel computing on commodity hardware and software at low cost. The system is distributed shared memory, which allows processes to assume a globally shared virtual memory even though they do not physically share memory. It is much easier for programmer to modify a sequential program to run on a shared memory system than the one that doesn’t share memory which needs rewrite the communication between different processes. Also it’s much easier for them to migrate a program written for the DSM system to a shared memory multiprocessor.

The DSM system provides some special programming interfaces. It uses Tmk_malloc to allocate memory from the shared memory space while the programmer can still use the traditional malloc to allocate memory for local usage. Synchronization mechanisms are necessary for keeping the shared data consistent between different processes. TreadMarks provides barriers and locks interfaces for this purpose.

There are many challenges to implement a distributed shared memory system. The consistency model is the first one. Sequential consistency provides a simple and intuitive interface for the programmers. But the implementation will cause a lot of communication between the processors, and the network communication is very expensive. It can even cause the “false sharing” problem which degrades the performance of the system. So in TreadMarks the author uses a relaxed consistency mode, lazy release consistency model. It defines a happens before relationship between the synchronization events such as lock acquiring, lock releasing, hit/leave barrier and so on. And the updated data will only be transferred when another process try to acquire the lock and use the data, which will reduce a lot of messages.

Another optimization made by the system is to use a twin copy to record the writes. And it only sends diff to another process when requested. This method can also support concurrent writes to the same memory page. And the writes can later be combined by the following processes. As mentioned in the paper, conflict writes will not happen in correct program, because lock protection is needed for concurrent writes to the same data object.

The system has been used for many applications. The author first showed how easy it is to use the primitives of TreadMarks to modify two simple programs. Later the author evaluated the performance of the two complicated application on TreadMarks. It shows linear speedup with the increase of number of processor. I think the TreadMarks is a good distributed shared memory system, but the shared memory system is not a good choice for commercial system. Because most programmers don’t like “complicated” systems.

Posted by: Xiaoming Shi | March 19, 2012 05:08 PM

CS 739 Reviews - Spring 2012

TreadMarks: Shared Memory Computing on Networks of Workstations

Comments

Post a comment