Reading List

The Selfish Gene
The Psychopath Test: A Journey Through the Madness Industry
Bad Science
The Feynman Lectures on Physics
The Theory of Everything: The Origin and Fate of the Universe


ifknot's favorite books »

Monday, 1 January 2018

C++ Mellor-Crummey & Scott (MCS) Lock - Milestones In Computer Science


I have been reading Herlihy, Maurice; Shavit, Nir. The Art of Multiprocessor Programming, Revised Reprint. Elsevier Science. (Kindle Edition) and converting the Java lock code into C++ here is my favourite so far: 

The Mellor-Crummey & Scott Spin Lock

The Mellor-Crummey & Scott (MCS) Lock, due to John Mellor-Crummey and Michael Scott improves upon their ticket lock by expanding a spinlock into a per-thread structure, an MCS lock is able to eliminate much of the cache-line bouncing experienced by simpler locks, especially in the contended case. 

The MCS Lock use an explicit linked list of synchronization variables, into which threads' synchronization variables are enqueued by order of arrival and avoids thread starvation by guaranteeing that an enqueued thread will, eventually, get access.

The MCS Lock is, therefore, fair and scalable and is the primary example of linked list queue lock family of locking strategies.

Its creators were awarded the 2006 Edsger W. Dijkstra Prize in Distributed Computing for their 1991 paper, "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors."



"Mellor-Crummey and Scott's paper introduced the MCS queue-based mutual exclusion lock: probably the most influential practical mutual exclusion algorithm of all time...[it] is fast, scalable, and fair in a wide variety of multiprocessor systems, and eliminates serious drawbacks of other algorithms, such as the need to pre-allocate memory for a fixed, predetermined number of threads."

For implementation the MCS Lock requires a list of very simple node structures that contain the thread's own synchronization variable and a pointer to the next waiting thread's node.

N.B. Each of these variables should be cache aligned in order to avoid the performance penalty from contention because of the cache coherence phenomenon termed false sharing.

False sharing occurs when adjacent array synchronization variables share a single cache line. A write to one synchronization variable invalidates that variables’s cache line, which causes invalidation traffic to processors that are spinning on unchanged but adjacent synchronization variables that happen to fall into the same cache line.

The evil that is false sharing can be avoided by ensuring that each node element consumes a cache line size block of memory - almost uniformly 64 bytes on Intel and AMD processors at all cache levels. Filling up, what would be otherwise unused memory, to meet the cache line size requirement is termed padding

To acquire the lock, a thread appends its own synchronization variable node at the tail of the linked list. If the queue was not previously empty, it sets the predecessor node's 'next' field to refer to its own node. The thread then spins on the local locked synchronization field in its own node, waiting until its predecessor sets this field to false.


Spinning

Spinning, as a lock strategy, has its own issues. According to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. (PAUSE is backwards compatible and turns into a NOP instruction on older CPUs.) There are three reasons why using PAUSE improves performance:
  • It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
  • It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
  • It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn, considerably reduces power consumption.
As often seen in the literature an otherwise empty "spin-wait loop" would have some undesirable effects like branch miss prediction, thread starvation on HT processors and overly high power consumption - so in short, it is a pretty inefficient way to wait and the impact and solution are architecture, platform and application specific.

The usual solution seems to be to add either - cpu_relax() on linux or YieldProcessor() on windows to the loop body. Or if x86 specific use the PAUSE instruction whose sole use is in spin-lock wait loops where it improves the performance of spin-wait loops.

When executing a spin-wait loop, processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop.
Also, inserting a pause instruction in a spin-wait loop greatly reduces the processor’s power consumption.

N.B. The PAUSE instruction is not the same as std::this_thread::yield which hands cpu control to the thread scheduler the PAUSE concept, instead,instructs the cpu to avoid branch prediction and a variety of other things.

In order to make the MCS lock program more readable the compiler specific details of implementing the PAUSE instruction are separated out their own header file.


cpu_relax.h 

#ifndef CPU_RELAX_H
#define CPU_RELAX_H
#include <intrin.h>
namespace sync {
inline static void cpu_relax() {
#if (COMPILER == MVCC)
_mm_pause();
#elif (COMPILER == GCC || COMPILER == LLVM)
asm volatile("pause\n": : :"memory");
#endif
}
}
#endif // CPU_RELAX_H
view raw cpu_relax.h hosted with ❤ by GitHub

mcs_lock.h 

#ifndef MCS_LOCK_H
#define MCS_LOCK_H
#include <cstdint>
#include <assert.h>
#include <atomic>
#include <thread>
#include <iostream>
#include "cache_info.h"
#include "cpu_relax.h"
namespace sync {
class mcs_lock {
struct mcs_node {
bool locked{true};
uint8_t pad1[CACHELINE_SIZE - sizeof(bool)];
mcs_node* next{nullptr};
uint8_t pad2[CACHELINE_SIZE - sizeof(mcs_node*)];
};
static_assert(sizeof(mcs_node) == 2 * CACHELINE_SIZE, "");
public:
void lock();
void unlock();
private:
std::atomic<mcs_node*> tail{nullptr};
static thread_local mcs_node local_node;
};
}
#endif // MCS_LOCK_H
view raw mcs_lock.h hosted with ❤ by GitHub

mcs_lock.cpp

#include "mcs_lock.h"
namespace sync {
void mcs_lock::lock() {
//to acquire the lock a thread atomically appends its own local node at the tail of the list returning tail's previous contents
auto prior_node = tail.exchange(&local_node,
std::memory_order_acquire);
if(prior_node != nullptr) {
local_node.locked = true;
//if the list was not previously empty, it sets the predecessor’s next field to refer to its own local node
prior_node->next = &local_node;
//thread then spins on its local locked field, waiting until its predecessor sets this field to false
while(local_node.locked)
cpu_relax();
}
//now first in the queue, own the lock and enter the critical section...
}
void mcs_lock::unlock() {
//...leave the critical section
//check whether this thread's local node’s next field is null
if(local_node.next == nullptr) {
//if so, then either:
// 1. no other thread is contending for the lock
// 2. there is a race condition with another thread about to
//to distinguish between these cases atomic compare exchange the tail field
//if the call succeeds, then no other thread is trying to acquire the lock, tail is set to nullptr, and unlock() returns
mcs_node* p = &local_node;
if(tail.compare_exchange_strong(p,nullptr,
std::memory_order_release,
std::memory_order_relaxed)) {
return;
}
//otherwise, another thread is in the process of trying to acquire the lock, so spins waiting for it to finish
while (local_node.next == nullptr) {};
}
//in either case, once the successor has appeared, the unlock() method sets its successor’s locked field to false, indicating that the lock is now free
local_node.next->locked = false;
//at this point no other thread can access this node and it can be reused
local_node.next = nullptr;
}
thread_local mcs_lock::mcs_node mcs_lock::local_node = mcs_lock::mcs_node{};
}
view raw mcs_lock.cpp hosted with ❤ by GitHub

No comments:

Post a Comment