Succeeding With ClangFormat, Part 2: The Big Reformat

| | c++ clang workflow style

When properly integrated into a toolchain, ClangFormat can entirely do away with time wasted on discussion and enforcement of code formatting. In part 1 of this series, I laid out the case for doing so, the factors that doomed our prior attempt, and the approach we took to get it right the next time. In this part I’ll walk through all the details that have to be considered before drawing up a functional specification and reaching the next milestone: codebase conversion.

Setting the format

Landing on a format was surprisingly easy, considering how contentious formatting choices can be. In this area, MongoDB has the benefit of being towards the larger end of team size. In a large shop, developers seem more understanding that "there is a way of doing things" that might not be their personal preference. But regardless of your team's size, everyone has to agree on the fundamental principle that a standard is more important that which standard. Start your initiative with obtaining buy-in, and demonstrate your commitment to solving disputes fairly, and you will find this step is not as fraught as you might expect.

Succeeding With ClangFormat, Part 1: Pitfalls And Planning

| | c++ clang workflow style

Last year, MongoDB began using ClangFormat to apply a globally consistent format to our C++ codebase, and has maintained that uniformity ever since. The most important factor in our success wasn’t deciding on the particular format or handling git issues. It was making sure it was effortless for developers to produce properly formatted code, and integrating automated checks at every phase of our dev process.

I was the developer in charge of designing our ClangFormat implementation and integrating it into our process, as well as “chief cat herder” to achieve consensus on code format. Planning and rolling out the use of a formatting tool is not too hard; but it requires forethought, coordination, and a commitment to enabling and enforcing its use. It can be time consuming, but the end result is that everyone has only one format to grok. After, every moment of time wasted on code formatting or discussion thereof is eliminated. Maybe you know entirely different types of developers than I do, but in my experience, that's a lot of time saved.

The difficulty of maintaining consistent formatting

MongoDB is a large open source code base with over a half-million lines of code, scores of full-time developers, and many community contributors. But even with smaller projects, most developers discover the problems of working without an agreed upon format the very first time they work on a team. This irritation can lead to religious arguments over the merits of various formatting choices; but mature engineers know that a standard is more important than which standard.

The New Grad Rotation Program: Optimizing Team Fit And Enhancing Collaboration

| | hiring onboarding

MongoDB has a unique way of placing newly minted engineers on their teams. Engineers right out of college — “new grads” as we call them — try out three different teams before choosing the best fit. I recently finished my time in MongoDB’s New Grad Rotation Program. While it was challenging, I’m confident it made me a better engineer and set me up for success at my first job out of college. I loved my experience and was curious how this program came to be and what others thought of it, so I asked around. This is what I learned.

The New Grad’s Dilemma

The search for my ideal team actually began at the beginning of my senior year of college. I had just turned 21 and I was overwhelmed with some of the biggest decisions of my life. Not only did I suddenly have to choose my beer at the Alehouse, but I also had to choose where to start my career! Every paper I had written and project I had turned in was building to this moment. If I chose the wrong company, or even the wrong team at the right company, I could be set back years. When I decided to join MongoDB, I thought that my career dilemma was over for the moment, but I still had one critical decision to make.

College taught me a whole lot about computer science and a whole little about working in the industry. My college transcript would tell you I should work in algorithmic theory, but no algorithm I could devise would help me decide if I should engineer query optimization or backup automation. MongoDB has teams that work on low-level systems, front-end web development, and everything in between. If I had no idea what team to join, our recruiters and engineers certainly didn’t either.

An Ambitious Solution

I am not the only engineer to have faced these issues, and new grads are not the only ones affected by them. Three years ago our recruiting and engineering teams decided to tackle this problem with MongoDB’s New Grad Rotation Program. During their first two weeks, new grads hear about each of the 12 teams on which they can rotate, list their top five preferences, and get placed on three. They then spend six-to-eight weeks on a rotation with each team. During each rotation, new grads weigh the work they’re doing, the technologies they use, and each team’s atmosphere. Then after they have experienced them all, they rank the teams.

The rotations last six months in total, and are a huge investment of both new and experienced engineering time. However, the payoff is tremendous, as rotations nurture extraordinarily productive engineers who love their jobs, excel at them, and have a wide view of the rest of the company.

Multiplexing Golang Channels to Maximize Throughput

| | concurrency archiving golang

The Go language is great for concurrency, but when you have to do work that is naturally serial, must you forgo those benefits? We faced this question while rewriting our database backup utility, mongodump, and utilized a “divide-and-multiplex” method to marry a high-throughput concurrent workload with a serial output.

The Need for Concurrency

In MongoDB, data is organized into collections of documents. When reading from a collection, requests are often preempted, when other processes obtain a write lock in that collection. To prevent stalls from reducing overall throughput, you can enqueue reads from multiple collections at once. Thus, a previous version of mongodump concurrently read data across collections, to achieve maximum throughput.

However, since the old mongodump wrote each collection to a separate file, it did not work for two very common use cases for database backup utilities: 1) streaming the backup over a network, and 2) streaming the backup directly into another instance as part of a load operation. Our new version was designed to support these use cases.

To do that, while preserving the throughput-maximizing properties of concurrent reads, we leveraged some Golang constructs -- including reflection and channels -- to safely permit multiple goroutines to concurrently feed data into the archive. Let me show you how.

Digging into D3 Internals to Eliminate Jank over Large Data Sets

| | javascript d3 optimization ui

Browser JavaScript isn’t like most other user-facing application runtimes, where a main thread handles the UI and you can spin work off into background threads. Because it’s single-threaded, when you have to do heavy lifting, sometimes you need to get creative to keep your UI responsive.

This was the case on one of Cloud Manager’s newest features, the Visual Profiler. It was an ambitious design, and when I was given the initial mocks I was immediately excited by the prospect. As a front-end engineer, I couldn’t wait to start implementing the new chart and table.

Code-generating Away the Boilerplate in Our Migration Back to Spidermonkey

| | c++ metaprogramming embedding javascript

If creativity emerges from constraint, then embedding a JavaScript engine into your database server is a recipe for creativity.

We’ve worked with the internals of a lot of different JavaScript engines, and one thing stands out: JavaScript is hard to embed for anything but the most trivial uses. To do it well, without writing a ton of mind-numbing and maintenance-hostile duplicode, we had to use a wide slice of the code-generating capabilities of C++ in concert. In our most recent migration from the V8 engine back to SpiderMonkey, we eliminated boilerplate by:

  • Generating callbacks with unique template instantiations
  • Implementing type integrations through policy based class design
  • Generating callbacks with constrained method invocation with compile time type lists
  • Utilizing a Lippincott function to provide C++ and Javascript exception interchangeability

I’ve put together a compiling walkthrough of these techniques in use, but before we get there, an examination of the prevailing context is in order...

Building an Async Networking Layer for mongos

| | async networking c++ mongos

Many people run their MongoDB servers in a sharded cluster. In such a setup, a mongos sits between the user’s application and their sharded data. Clients connect to the mongos and send it queries, and mongos routes those queries to one or more shards to be fulfilled.

In most cases, mongos can pinpoint a single shard for each given query. However, some queries require “scatter gather” routing; in other words, mongos has to send the query to all shards, wait for their responses, and assemble them into a single master response. We could fan these requests out to shards serially, but then one slow connection would block mongos’ entire system. To do this efficiently, we needed a way to run requests concurrently.

Given the structure of the networking code in MongoDB 3.0, the only way to run requests concurrently was to run them in different threads. Some clusters have hundreds of shards—that’s a lot of requests to fan out. You can imagine what might occur in a mongos handling many requests: thread explosion!! Having too many threads can bog down a system, causing contention over hardware resources.

In 3.2, we wrote an alternate solution: asynchronous outbound networking for mongos. This new networking layer eliminates our thread explosion problem, but this new implementation brought with it difficult memory management challenges. It took a lot of experimentation, failure, iteration, and above all, obsessive testing to implement a new callback driven, asynchronous system.

Cat-Herd's Crook: How YAML test specs improve driver conformance

| | yaml drivers testing specifications

At MongoDB we write open source database drivers in ten programming languages. We also help developers in our community replicate our drivers' behavior in even more (and more exotic) languages. Ideally, all drivers behave alike; or, where they differ, the differences are written down and justified. How can we herd all these cats along the same path?

For years we failed. Each false start at standardization left us more discouraged. But we’ve recently gained momentum on standardizing our drivers. Human-readable, machine-testable specs, coded in YAML, prove which code conforms and which does not. These YAML tests are the Cat-Herd's Crook: a tool to guide us all in the same direction.