Simplicity is the ultimate sophistication.
TL;DR An experiment to show that a lock-free queue designed to service only a single producer and single consumer can complete 1B write-read transactions in less than a quarter (27%) of the time that a multiple producer and multiple consumer lock based queue will take. Further, replacing the modulus function with a bit mask leads to further performance gains (24%) and then using the different memory models appropriately results in yet more performance gains (18%), and very cool graphs can be drawn using plotly!
The libfbp general purpose course grained concurrent queue, described here, fbp::gpcg_concurrent_queue does exactly what it says on the tin. It is flexible offering both bounded and unbounded construction with a simple course grained single lock that is used to serialise both enqueues and dequeues by multiple threads; by which it offers its most useful quality in that it is provably correct. This quality makes it ideal for development and testing as it offers a point of reliable behaviour. However, because the single lock is a highly contended resource, particularly in a multiple-producer-multiple-consumer (MPMC) setting and uses a rather heavy weight ssynchronisationmethod in the form of the C++11 std::lock then it is "slow" - or it at least it is when compared with more specialised alternatives.
The libfbp general purpose course grained concurrent queue, described here, fbp::gpcg_concurrent_queue does exactly what it says on the tin. It is flexible offering both bounded and unbounded construction with a simple course grained single lock that is used to serialise both enqueues and dequeues by multiple threads; by which it offers its most useful quality in that it is provably correct. This quality makes it ideal for development and testing as it offers a point of reliable behaviour. However, because the single lock is a highly contended resource, particularly in a multiple-producer-multiple-consumer (MPMC) setting and uses a rather heavy weight ssynchronisationmethod in the form of the C++11 std::lock then it is "slow" - or it at least it is when compared with more specialised alternatives.
Why should queue speed matter?
Certainly, the performance of any libfbp application depends on many factors which will predominantly be domain or problem related. However, given that, by definition, FBP is an ensemble of connected compute entities, each with the potential to be strongly connected, then pushing the threshold at which fbp::connection communication speed becomes the rate limiting factor to be as low as possible seems a reasonable design goal.
sample,transactions,mpmc_lock,spsc_atomic,spsc_atomic_2n,spsc_atomic_2n_mem
1,1000,0.341675,0.102256,0.089984,0.085439
2,10000,2.20124,0.639295,0.448542,0.384298
3,100000,20.4598,5.54733,4.237,3.96572
4,1000000,286.771,76.4742,54.7361,36.6485
5,10000000,2043.46,557.767,475.868,409.063
6,100000000,19096.3,5454.46,4231.87,3345.69
7,1000000000,190250,51226.7,45168.4,34635.3
MPMC vs SPSC
What are the best design decisions that reduce the threshold at which communication speed is the rate limiting factor? Using a lock free approach to the message queue underlying fbp::connection stands out as the most promising, particularly in the special case of a single-producer-single-consumer (SPSC) connection where a circular buffer with atomically updated head and tail pointers uses the much more lightweight and non-contended std::atomic<> for synchronisation.Hypothesis:
Thats a lock-free SPSC queue will significantly outperform a MPMC lock based queue even in the absence of synchronisation contention.
Lock-Free Single-Producer-Single-Consumer Circular Queue
Based on this excellent article[1] the fbp::spsc_circular_buffer template class was developed and performance tested against the fbp::gpcg_concurrent_queue. The lock-free queue was tested in 3 forms:- Using native std::atomic buffer iterators and modulus operator % to limit them to a circular range.
- Limiting the the buffer capacity to a power of 2 such that the modulus operator % can be dropped in favour of masking out the redundant higher bits of the head and tail iterators in order to limit them to a circular range.
- Finally adding atomic load and store methods with the suggested memory model enhancements[1].
Materials:
- std::chrono::high_resolution_clock as described here.
- fbp::gpcg_concurrent_queue as described here.
- fbp::spsc_circular_buffer as described below.
Method:
Each queue type was timed with ms resolution against an increasing number of write-read transaction pairs ranging from 1 thousand to 1 billion transactions using the following code:Results:
This snippet was generalised and the performance results dumped to file as csv format file:sample,transactions,mpmc_lock,spsc_atomic,spsc_atomic_2n,spsc_atomic_2n_mem
1,1000,0.341675,0.102256,0.089984,0.085439
2,10000,2.20124,0.639295,0.448542,0.384298
3,100000,20.4598,5.54733,4.237,3.96572
4,1000000,286.771,76.4742,54.7361,36.6485
5,10000000,2043.46,557.767,475.868,409.063
6,100000000,19096.3,5454.46,4231.87,3345.69
7,1000000000,190250,51226.7,45168.4,34635.3
This data was then loaded into the excellent online scientific graphing software at plotly to produce a graph (below) displaying the comparison of queue performance (ms) between fbp::gpcg_concurrent_queue (mpmc_lock) and the 3 variations of fbp::spsc_circular_buffer (spsc_atomic, spsc_atomic_2n, and spsc_atomic_2n_mem) as described above.
Assumptions:
- That repeated timing samples are independent of each other.
- Thats timing sample have a very narrow or leptokutic normal distribution.
- That each queue is structure persistent throughout the test.
- That different queue capacities will not affect the performance.
Analyisis:
The graph and data table demonstrates that whilst each queue type has the same O(N) complexity the lock free fbp::spsc_circular_buffer variations can complete the same number of transactions as the single lock based fbp::gpcg_concurrent_queue queue in 27%, 24% and 18% of the time!
Coefficients count when complexity is equal.
Therefore, despite their shared O(N) complexity, the speedup coefficient for the lock free SPSC approach will push the limiting factor threshold of libfbp connections way below that of a lock based approach.
Conclusion:
By choosing to restrict connections between components to a single producer and a single producer then a lock free connection can be implemented which, by virtue of its transaction performance, significantly lowers the threshold at which the connection becomes the rate limiting step when compared with a single lock queue structure.
Improvements and future work:
There are alternatives to a single lock and it is likely that a dual lock queue will outperform a single lock queue but unlikely that it will outperform a lock-free implementation. However, this remains to be shown.
Further, the assumptions 1, 2 & 4 also remain to be shown.
No comments:
Post a Comment