Tell us about the time you made DNS resolution concurrent in Python on Mac and BSD.
No, no, you do not want to hear that story, my friends. It is nothing but old lore and #ifdefs.
But you made Python more scalable. The saga of Steve Jobs was sung to you by a mysterious wizard with a fanciful nickname! Tell us!
Gather round, then. I will tell you how I unearthed a lost secret, unbound Python from old shackles, and banished an ancient and horrible Mutex Troll.
Let us begin at the beginning...
A long time ago, in the 1980s, a coven of Berkeley sorcerers crafted an operating system. They named it after themselves: the Berkeley Software Distribution, or BSD. For generations they nurtured it, growing it and adding features. One night, they conjured a powerful function that could resolve hostnames to IPv4 or IPv6 addresses. It was called getaddrinfo. The function was mighty, but in years to come it would grow dangerous, for the sorcerers had not made getaddrinfo thread-safe.
As ages passed, BSD spawned many offspring. There were FreeBSD, OpenBSD, NetBSD, and in time, Mac OS X. Each made its copy of getaddrinfo thread safe, at different times and different ways. Some operating systems retained scribes who recorded these events in the annals. Some did not.
Because getaddrinfo is ringed round with mystery, the artisans who make cross-platform network libraries have mistrusted it. Is it thread safe or not? Often, they hired a Mutex Troll to stand guard and prevent more than one thread from using getaddrinfo concurrently. The most widespread such library is Python's own socket module, distributed with Python's standard library. On Mac and other BSDs, the Python interpreter hires a Mutex Troll, who demands that each Python thread hold a special lock while calling getaddrinfo.
Fuzz testing is a method for subjecting a codebase to a tide of hostile input to supplement the test cases engineers create on their own. In part one of this pair, we looked at the hybrid nature of our fuzzer -- how it combines “smart” and “dumb” fuzzing to produce input random enough to provoke bugs, but structured enough to pass input validation and test meaningful codepaths. To wrap up, I’ll discuss how we isolate signal from the noise a fuzzer intrinsically produces, and the tooling that augments the root cause analyses we do when the fuzzer finds an issue.
In part one of two, we examine how our fuzzer hybridizes the two main types of fuzzing to achieve greater coverage than either method alone could accomplish. Part two will focus on the pragmatics of running the fuzzer in a production setting and distilling a root cause from the complex output fuzz tests often produce.
What's a fuzzer?
Fuzzing, or fuzz testing, is a technique of generating randomized, unexpected, and invalid inputs to a program to trigger untested code paths. Fuzzing was originally developed in the 1980s and has since proven to be effective at ensuring the stability of a wide range of systems, from filesystems to distributed clusters to browsers. As people attempt to make fuzzing more effective, two philosophies have emerged: smart, and dumb fuzzing. And as the state of the art evolves, the techniques that are used to implement fuzzers are being partitioned into categories, chief among them being “generational” and “mutational.” In many popular fuzzing tools, smart fuzzing corresponds to generational techniques, and dumb fuzzing to mutational techniques, but as we will see, this is not an intrinsic relationship. Indeed, in our case, the situation is precisely reversed.
Until last year, Jeremy Mellema was a history teacher. Now, he's teaching computer programming. When I visited his class in the Bronx this month, he had 30 students with 30 MacBooks, completing exercises in Python. They had just finished a lesson on data types, and now they were tackling variables. In Jeremy's class, the first variable assignment is:
tupac = "Greatest of All Time!!"
Computer Science for All
A year ago, New York City mayor Bill de Blasio announced Computer Science for All, an $80 million public-private partnership. The goal is to teach computer science to every student at every public school. But first, the schools need curricula and 5000 teachers need training.
Here at MongoDB, our VP of Education Shannon Bradshaw oversees MongoDB University, which trains IT professionals to use MongoDB. When he heard about CS4All, he wanted us to contribute. He proposed that we set aside budget for two paid fellowships, and recruit public school teachers to spend the summer with us. We would develop them as teachers, and help build curricula they could take back into schools this fall. MongoDB staff would share our expertise, our office space, our equipment, and the MongoDB software itself.
Shannon pitched his proposal to the company like this: "As many of us know, it’s still unusual for students to encounter computer science, let alone databases, in their classrooms before entering college. I believe this absence directly contributes to the gender and racial disparity we see today across our industry." The CS4All project improves access to these subjects for many more students in our city, and MongoDB could be part of it from the beginning.
If you’ve been following our series on succeeding with ClangFormat, you already know all about why we did it and the steps we took to ensure the migration went well. In this concluding post, we’ll talk about how to succeed after the integration and reformat are complete. We learned some valuable lessons about what happens in the immediate aftermath of bringing ClangFormat into our system and have been refining our workflows ever since. Here’s a look at our occasionally bumpy road and how you might have a smoother one.
We’ve all been there: you’re pitching a solution when one of your team members interjects, “let’s not reinvent the wheel, here.” Whether it’s based on fear or wisdom, the charge of reinventing the wheel is a death sentence for ideas. It typically isn’t worth the time and resources to implement a new version of an old, ubiquitous idea—though you’d never know that with all the different kinds of actual, literal wheels you use every day.
For most developers, continuous integration (CI)—the automated building and testing of new code pushed into your repository—is one of those never-reinvented wheels. You set up one of a few long-standing solutions like Travis or Jenkins, rejigger your test code to fit into that solution’s organizational model, and then avoid messing with it too much. Here at MongoDB, challenging this approach rewarded us incredibly.
Instead of working around an off-the-shelf solution that didn’t fit our needs, we wound up reinventing the wheel and built our own continuous integration system called Evergreen. It gives us a powerful, efficient infrastructure that lets us test changes quickly -- and keeps our engineers happy as well. Our journey to creating Evergreen was born of necessity and stalked by uncertainty, but we don’t regret it. Reinventing the wheel allowed us to build a near-perfect CI tool for our use case, seriously evaluate powerful new technologies, and have a lot of fun doing it.
When properly integrated into a toolchain, ClangFormat can entirely do away with time wasted on discussion and enforcement of code formatting. In part 1 of this series, I laid out the case for doing so, the factors that doomed our prior attempt, and the approach we took to get it right the next time. In this part I’ll walk through all the details that have to be considered before drawing up a functional specification and reaching the next milestone: codebase conversion.
Setting the format
Landing on a format was surprisingly easy, considering how contentious formatting choices can be. In this area, MongoDB has the benefit of being towards the larger end of team size. In a large shop, developers seem more understanding that "there is a way of doing things" that might not be their personal preference. But regardless of your team's size, everyone has to agree on the fundamental principle that a standard is more important that which standard. Start your initiative with obtaining buy-in, and demonstrate your commitment to solving disputes fairly, and you will find this step is not as fraught as you might expect.
Last year, MongoDB began using ClangFormat to apply a globally consistent format to our C++ codebase, and has maintained that uniformity ever since. The most important factor in our success wasn’t deciding on the particular format or handling git issues. It was making sure it was effortless for developers to produce properly formatted code, and integrating automated checks at every phase of our dev process.
I was the developer in charge of designing our ClangFormat implementation and integrating it into our process, as well as “chief cat herder” to achieve consensus on code format. Planning and rolling out the use of a formatting tool is not too hard; but it requires forethought, coordination, and a commitment to enabling and enforcing its use. It can be time consuming, but the end result is that everyone has only one format to grok. After, every moment of time wasted on code formatting or discussion thereof is eliminated. Maybe you know entirely different types of developers than I do, but in my experience, that's a lot of time saved.
The difficulty of maintaining consistent formatting
MongoDB is a large open source code base with over a half-million lines of code, scores of full-time developers, and many community contributors. But even with smaller projects, most developers discover the problems of working without an agreed upon format the very first time they work on a team. This irritation can lead to religious arguments over the merits of various formatting choices; but mature engineers know that a standard is more important than which standard.
MongoDB has a unique way of placing newly minted engineers on their teams. Engineers right out of college — “new grads” as we call them — try out three different teams before choosing the best fit. I recently finished my time in MongoDB’s New Grad Rotation Program. While it was challenging, I’m confident it made me a better engineer and set me up for success at my first job out of college. I loved my experience and was curious how this program came to be and what others thought of it, so I asked around. This is what I learned.
The New Grad’s Dilemma
The search for my ideal team actually began at the beginning of my senior year of college. I had just turned 21 and I was overwhelmed with some of the biggest decisions of my life. Not only did I suddenly have to choose my beer at the Alehouse, but I also had to choose where to start my career! Every paper I had written and project I had turned in was building to this moment. If I chose the wrong company, or even the wrong team at the right company, I could be set back years. When I decided to join MongoDB, I thought that my career dilemma was over for the moment, but I still had one critical decision to make.
College taught me a whole lot about computer science and a whole little about working in the industry. My college transcript would tell you I should work in algorithmic theory, but no algorithm I could devise would help me decide if I should engineer query optimization or backup automation. MongoDB has teams that work on low-level systems, front-end web development, and everything in between. If I had no idea what team to join, our recruiters and engineers certainly didn’t either.
An Ambitious Solution
I am not the only engineer to have faced these issues, and new grads are not the only ones affected by them. Three years ago our recruiting and engineering teams decided to tackle this problem with MongoDB’s New Grad Rotation Program. During their first two weeks, new grads hear about each of the 12 teams on which they can rotate, list their top five preferences, and get placed on three. They then spend six-to-eight weeks on a rotation with each team. During each rotation, new grads weigh the work they’re doing, the technologies they use, and each team’s atmosphere. Then after they have experienced them all, they rank the teams.
The rotations last six months in total, and are a huge investment of both new and experienced engineering time. However, the payoff is tremendous, as rotations nurture extraordinarily productive engineers who love their jobs, excel at them, and have a wide view of the rest of the company.
The Go language is great for concurrency, but when you have to do work that is naturally serial, must you forgo those benefits? We faced this question while rewriting our database backup utility, mongodump, and utilized a “divide-and-multiplex” method to marry a high-throughput concurrent workload with a serial output.
The Need for Concurrency
In MongoDB, data is organized into collections of documents. When reading from a collection, requests are often preempted, when other processes obtain a write lock in that collection. To prevent stalls from reducing overall throughput, you can enqueue reads from multiple collections at once. Thus, a previous version of mongodump concurrently read data across collections, to achieve maximum throughput.
However, since the old mongodump wrote each collection to a separate file, it did not work for two very common use cases for database backup utilities: 1) streaming the backup over a network, and 2) streaming the backup directly into another instance as part of a load operation. Our new version was designed to support these use cases.
To do that, while preserving the throughput-maximizing properties of concurrent reads, we leveraged some Golang constructs -- including reflection and channels -- to safely permit multiple goroutines to concurrently feed data into the archive. Let me show you how.