Part one of this pair explored the original algorithm the WiredTiger write-ahead log used to consolidate writes in order to minimize IO. It used atomic compare-and-swap operations in two phases to accomplish this without time-consuming locking. This algorithm worked extremely well as long as there were no more than a few threads running per core. But its reliance on busy-waiting to avoid locking caused a logjam when the number of threads increased beyond that limit -- a serious problem given that many MongoDB workloads would have a large number of threads per core. This issue was blocking MongoDB’s goal of making WiredTiger the default storage engine in v3.2.
This story has a happy ending thanks to my colleague, Senior Technical Service Engineer Bruce Lucas. Bruce had initially uncovered the logjam and reported it to me; together, we overcame it without compromising any other workloads. Because Bruce’s mindset was not colored by the legacy of the original approach, he was able to provide the critical insight that paved the way for the solution, allowing WiredTiger to become the default storage engine in v3.2.
Code can't be optimized; it can only be optimized for a set of conditions. When conditions change, optimizations can become bottlenecks, and when that happens, a thorough audit of assumptions might well hold the key to the solution.
The WiredTiger write-ahead log exemplifies this principle. It’s a critical codepath within a high-performance storage engine, and I have optimized it heavily to avoid I/O and locking. But some of the conditions I had initially targeted became invalid when WiredTiger became a storage engine in MongoDB. When a colleague of mine investigated a case of negative scaling found during testing, he uncovered a serious bottleneck in the write-ahead log… call it a "logjam". That investigation ultimately led us to rethink our assumptions and optimize for new conditions. We validated the new approach with a quick prototype, and then covered all the intricacies and edge cases to produce a fully realized solution.
In part one of this two-part series, I’ll dive deep into the innards of the WiredTiger write-ahead log. I’ll show how it orchestrates many threads writing to a single buffer without locking, and I’ll explain how two conflicts between that design and the new conditions produced the logjam. Part two will focus on how we eliminated the bottleneck. I’ll analyze its root causes, describe the key insight that enabled our solution, and detail the new algorithm and how it reflects our current conditions.
The Go language is great for concurrency, but when you have to do work that is naturally serial, must you forgo those benefits? We faced this question while rewriting our database backup utility, mongodump, and utilized a “divide-and-multiplex” method to marry a high-throughput concurrent workload with a serial output.
The Need for Concurrency
In MongoDB, data is organized into collections of documents. When reading from a collection, requests are often preempted, when other processes obtain a write lock in that collection. To prevent stalls from reducing overall throughput, you can enqueue reads from multiple collections at once. Thus, a previous version of mongodump concurrently read data across collections, to achieve maximum throughput.
However, since the old mongodump wrote each collection to a separate file, it did not work for two very common use cases for database backup utilities: 1) streaming the backup over a network, and 2) streaming the backup directly into another instance as part of a load operation. Our new version was designed to support these use cases.
To do that, while preserving the throughput-maximizing properties of concurrent reads, we leveraged some Golang constructs -- including reflection and channels -- to safely permit multiple goroutines to concurrently feed data into the archive. Let me show you how.