Scaling Deployments at Oscar With Pants, Aurora, and Mesos

1,657
Oscar Health
Healthcare is broken; we're trying to fix it. The Oscar team is focused on utilizing technology, design and data to humanize healthcare. We're a group of technology and healthcare professionals who looked at the current state of the US healthcare system, got frustrated by the horrible consumer experience, and decided to do something big about it. Backed by a renowned set of investors and advisors, we’ve set out to revolutionize healthcare.
Background

Founded in 2012, Oscar makes healthcare simple, transparent, and human. By putting people first, we’ve created a new kind of health insurance company, one that uses a high-tech, data-driven approach, easy-to-understand language, and a unique set of benefits to change the way people think about and interact with healthcare.

About a year and a half ago, I joined Oscar as one of our early engineers and helped build out the first versions of a number of our core products and systems (care router, payments, and billing). Today, I lead our product engineering team which is responsible for all of our external facing web, mobile, and physical interactions (as much we as try to save trees, we still are required by law to send some communication by mail).

Architecture

We are, by and large, a Python shop with MySQL as our primary database. Python has given us a robust set of tools and sharable code to use across our web applications, batch jobs, and services. About a year ago, once Python had become ingrained at Oscar, we made three big architectural decisions to improve our ability to implement and deploy Python projects across our teams: we consolidated to a monolithic git repo, adopted Pants as our build tool, and began moving to Apache Aurora and Apache Mesos for job running.

Prior to these initiatives, we were running a number of different projects out of different repos, all with their own build and deploy steps. This meant that any change which affected more than one component usually also required manually editing configs in each project and then carefully coordinating a domino of deploys. With that sort of overhead, we considered getting one or two deploys out in a week a success.

Today, we average more than one deploy per engineer per day (a number that’s steadily been going up as our tooling has matured). These deploys are spread out across nearly two hundred backend systems and two web projects. The biggest contributors to our ability to ramp-up our deploy frequency are Pants and Aurora/Mesos.

Pants is a multi-language build tool open sourced by Twitter. The big draw to Pants for us is its ability to create a PEX (Python EXecutable) binary. It’s effectively a statically linked Python binary, so as long as you have a matching Python interpreter and are on the same OS, your PEX will just run (barring Python C extensions, which are still the Wild West as far as what they link to). There’s no messing with code checkouts, building virtualenvs, or installing C compilers on production machines - all of that can be centralized to an actual build server (Jenkins in our case). Since we had already unified on CentOS throughout our infrastructure, getting these newly build PEXes up and running on Mesos was quite straightforward.

Aurora and Mesos are two Apache foundation projects that provide a cluster compute framework (Mesos) and a scheduler (Aurora) for managing ad hoc and cron jobs as well as long running services. In order to run programs on Mesos, they must be able to run in a homogeneous environment without job specific server tweaks. This is what makes the PEX binaries that Pants generates such a great match for Mesos. The Aurora/Mesos combo vastly simplifies the deploy process, reducing it to a few commands from terminal, even for complex jobs with multiple instances and intradependent processes. Additionally, we’ve added some components to our service discovery layer that, coupled with Aurora’s built-in update/health check system, allow us safely roll out new versions of highly available services with automated rollbacks if something goes wrong during a deploy.

Testing

As deploys have become simpler and more frequent, the need for a solid testing strategy has become even more crucial. Our original strategy was to simply test everything for every diff. As you can imagine, as our code base continued to grow, the number of targets that one diff could affect had risen to the point where testing the world on each diff was no longer a tenable strategy. We needed something smarter.

For that, it was Pants to the rescue once more. One of the best features of Pants, out of the box, is its dependency graph. With it, we can quickly identify all of the build targets and tests that are affected by a given diff. Taking advantage of that and the parallelizable nature of Pants targets/tests, we hooked into Phabricator’s command line client (Arcanist) to push new test runs to Kochiku at the creation or update of any diff. (Kochiku is Square’s open source parallelized test runner.) Once Kochiku is done with its test run, it comments on the Phabricator diff with the build report.

We’ve still got a few kinks to iron out (e.g., hooking into Phabricator’s actual tests failed/succeeded section and addressing a few issues around MySQL Embedded crashing the Kochiku workers), but overall, this strategy has been effective in allowing us to get the necessary test reports from across the entire code base in a reasonable amount of time.

Where we are going

Pants, Aurora/Mesos, and our monolithic repo helped us get to where we are today. Looking forward, we’re kicking off three major projects to continue enhancing the core of our architecture and tooling.

Firstly, we want to do for our web deploys what we did for our background jobs and services. Namely, while we’ve moved all of our background jobs and services to Mesos, we still rely on specialized boxes and Ansible scripts to manage our web deploys. However, we recently finished migrating both our web applications to being deployable from PEX files. We still have a few issues to resolve before we can cut everything over (like which WSGI container to use, how we can better handle health checks, and how we can make Aurora updates/rollbacks atomic from an external traffic perspective), but our goal is to retire all of our special purpose web boxes soon.

Secondly, we are now seeing a need for more mature services. Today, the majority of our services talk to one another using JSON over HTTP as a holdover from our early days. JSON was great for getting things out the door, and it was easy to shovel things directly to the browser or mobile API. That said, with more teams consuming the same services, and with our language pool expanding to include the likes of Java and Golang, having better defined contracts between services is becoming increasingly important. In order to get these better defined contracts, we’re beginning to standardize on Apache Thrift. Thrift was a natural choice for us with its built-in support from Pants as well as its robust client libraries in all of the languages we’re using.

Finally, on the frontend, we’re seeking to improve our testability and component reuse. While in the past we’ve primarily used Backbone.js in conjunction with server rendered HTML (via Jinja2 templates), we’re now starting to use React.js (coupled with ImmutableJS) for new projects. Along with strong contracts from our Thrift backend services, this new architecture allows us to set well defined boundaries around each piece of a system and, thus, better validate that each piece of the stack is working as expected. We’re excited to see the continued reliability and productivity improvements that having this component based architecture will yield.

Engineering Team

When I joined Oscar we had fewer than ten engineers, a marketing splash page, and a mandate to get a functional product out the door - and only two months before everything went live. Now we’re nearly forty engineers spread across five teams, focused on different business and engineering needs. In addition to product engineering, we have teams focused on technical operations, data systems, internal tools, and core business systems. While each team has a specific domain, every team has a well rounded group of engineers to allow them to tackle any problem that’s thrown their way. Teams also frequently collaborate for larger projects that span the scope of any one team (for example, product engineering and data engineering are currently working together on the next generation of our care router).

Hiring

We’re currently hiring across all our engineering teams (and the rest of the company too). While many of our engineers come from other tech companies (Google, Facebook, Tumblr, Amazon), we want to work with people who are interested in helping us tackle the challenge of improving healthcare for America.

Oscar Health
Healthcare is broken; we're trying to fix it. The Oscar team is focused on utilizing technology, design and data to humanize healthcare. We're a group of technology and healthcare professionals who looked at the current state of the US healthcare system, got frustrated by the horrible consumer experience, and decided to do something big about it. Backed by a renowned set of investors and advisors, we’ve set out to revolutionize healthcare.
Tools mentioned in article