cpp: C++ Mellor-Crummey & Scott (MCS) Lock - Milestones In Computer Science

I have been reading Herlihy, Maurice; Shavit, Nir. The Art of Multiprocessor Programming, Revised Reprint. Elsevier Science. (Kindle Edition) and converting the Java lock code into C++ here is my favourite so far:

The Mellor-Crummey & Scott Spin Lock

The Mellor-Crummey & Scott (MCS) Lock, due to John Mellor-Crummey and Michael Scott improves upon their ticket lock by expanding a spinlock into a per-thread structure, an MCS lock is able to eliminate much of the cache-line bouncing experienced by simpler locks, especially in the contended case.

The MCS Lock use an explicit linked list of synchronization variables, into which threads' synchronization variables are enqueued by order of arrival and avoids thread starvation by guaranteeing that an enqueued thread will, eventually, get access.

The MCS Lock is, therefore, fair and scalable and is the primary example of linked list queue lock family of locking strategies.

Its creators were awarded the 2006 Edsger W. Dijkstra Prize in Distributed Computing for their 1991 paper, "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors."

"Mellor-Crummey and Scott's paper introduced the MCS queue-based mutual exclusion lock: probably the most influential practical mutual exclusion algorithm of all time...[it] is fast, scalable, and fair in a wide variety of multiprocessor systems, and eliminates serious drawbacks of other algorithms, such as the need to pre-allocate memory for a fixed, predetermined number of threads."

For implementation the MCS Lock requires a list of very simple node structures that contain the thread's own synchronization variable and a pointer to the next waiting thread's node.

N.B. Each of these variables should be cache aligned in order to avoid the performance penalty from contention because of the cache coherence phenomenon termed false sharing.

False sharing occurs when adjacent array synchronization variables share a single cache line. A write to one synchronization variable invalidates that variables’s cache line, which causes invalidation traffic to processors that are spinning on unchanged but adjacent synchronization variables that happen to fall into the same cache line.

The evil that is false sharing can be avoided by ensuring that each node element consumes a cache line size block of memory - almost uniformly 64 bytes on Intel and AMD processors at all cache levels. Filling up, what would be otherwise unused memory, to meet the cache line size requirement is termed padding

To acquire the lock, a thread appends its own synchronization variable node at the tail of the linked list. If the queue was not previously empty, it sets the predecessor node's 'next' field to refer to its own node. The thread then spins on the local locked synchronization field in its own node, waiting until its predecessor sets this field to false.

Spinning

Spinning, as a lock strategy, has its own issues. According to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. (PAUSE is backwards compatible and turns into a NOP instruction on older CPUs.) There are three reasons why using PAUSE improves performance:

It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn, considerably reduces power consumption.

As often seen in the literature an otherwise empty "spin-wait loop" would have some undesirable effects like branch miss prediction, thread starvation on HT processors and overly high power consumption - so in short, it is a pretty inefficient way to wait and the impact and solution are architecture, platform and application specific.

The usual solution seems to be to add either - cpu_relax() on linux or YieldProcessor() on windows to the loop body. Or if x86 specific use the PAUSE instruction whose sole use is in spin-lock wait loops where it improves the performance of spin-wait loops.

When executing a spin-wait loop, processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop.
Also, inserting a pause instruction in a spin-wait loop greatly reduces the processor’s power consumption.

N.B. The PAUSE instruction is not the same as std::this_thread::yield which hands cpu control to the thread scheduler the PAUSE concept, instead,instructs the cpu to avoid branch prediction and a variety of other things.

In order to make the MCS lock program more readable the compiler specific details of implementing the PAUSE instruction are separated out their own header file.

cpp

Reading List

Monday, 1 January 2018

C++ Mellor-Crummey & Scott (MCS) Lock - Milestones In Computer Science

The Mellor-Crummey & Scott Spin Lock

Spinning

cpu_relax.h

mcs_lock.h

mcs_lock.cpp

No comments:

Post a Comment