How SendGrid Scaled to 40 Billion Emails Per Month

9,955
Twilio SendGrid
Twilio SendGrid is a digital communication platform that enables businesses to engage with their customers via email reliably, effectively and at scale.

Written by Seth Ammons, Principal Engineer at SendGrid


Some background

Founded in 2009, after graduating from the TechStars program, SendGrid developed an industry-disrupting, cloud-based email service to solve the challenges of reliably delivering emails on behalf of growing companies. We currently send over a billion emails daily (with a peak of over 2 billion emails sent in a single day) for companies like Spotify, Yelp, Uber, and Airbnb. We focus on work that enables our customers to be successful and reach their customers. If you open up your inbox, chances are that some of that mail was sent through SendGrid.

I'm a principal engineer here, and I've worked on nearly all aspects of our backend infrastructure over the last seven years. Currently, I work with the team responsible for our outbound MTA (mail transfer agent, the software that communicates with inbox providers). The team is based out of our Irvine office and I really enjoy coming to work every day. We have a high focus on testable, quality software, and our ability to focus on driving excellence in engineering is aided by our manager, who was formerly a developer and team lead for our delivery and processing teams.

Besides our focus on ensuring sending to and handling responses from the multitude of inbox providers, our work encompasses the handling of bounced email and unsubscribes. We work closely, often pairing on the more difficult tasks, striving to make sure that wanted mail is delivered, and maintaining suppression systems to prevent unwanted mail from being processed.


Early days at SendGrid

SendGrid's backend architecture has changed a lot since its inception. What started off as a glorified Postfix install, grew into a large push-based system, and that system is currently transforming into a more scalable pull-based model. As part of this transition, we are moving more and more of our services into the cloud.

In our legacy, push-based system, SMTP or HTTP API requests came in via our edge nodes, and those nodes pushed the requests to our processing cluster's on-disk queues. Once there, mail was mutated per user settings (link tracking, unsubscribe footers, dynamic content substitution, etc), and then pushed to our MTA software's on-disk queues. After being placed into the MTA's queue, the MTA worked to send the mail out quickly and efficiently as possible while applying algorithms to enhance the deliverability rate of that mail.

This system worked really well, but it did have some downsides such as potential event processing delays or even potential mail loss in the event of total node failure. In light of these drawbacks, we've worked towards a pull-based model backed by a distributed file system.


Our current architecture

In our new, pull-based model, the basic systems in place are still there (edge node receiving, processing, delivering); however, we've flipped the dynamic from pushing onto queues to pulling from our custom distributed queues instead (more on this shortly). This change allows our systems to be ephemeral, stateless services that can be spun up or down to match customer needs in a more real-time fashion, and this will be more evident as we increase our presence in and usage of Amazon Web Services (AWS).

The majority of our backend services are written (or being rewritten) in Go. The concurrent nature of our systems made Go a natural and easy choice as we moved from Perl's AnyEvent and Python's Twisted. These services leverage Redis and/or Redis Sentinel for caching and MySQL for data persistence. Metrics are emitted to Graphite and displayed in Grafana. Logs are forwarded from STDOUT to Syslog, where we have utilities that slurp these logs up for Kafka and Splunk. We have alerts setup through PagerDuty and it feeds off data from Splunk Alerts and Sensu checks.

Remember when I mentioned a custom, distributed queue? For queueing between services, one of our teams developed a specialized "heap of heaps" service called SGS (SendGrid Scheduler) that is backed by Ceph. This core piece of technology was needed to ensure that we could fairly dequeue messages without crowding out smaller senders or less popular recipient domains when larger users send out large blasts to popular inbox domains.

Roping this back in, what does our MTA's tech stack look like? The previous, legacy version was Perl with AnyEvent that had a parent process that forked off children processes. The parent scheduled work, and the children delivered mail. The switch to Go removed the callback-hell and the forking as Go's concurrency model is much easier to work with compared to AnyEvent. Being statically typed and compiled lets us actually know what variables are in scope in any given function, the absence of which was a major drawback to the Perl service. So, yeah, we like Go.

Requests come into the MTA, and we secure a connection to the inbox provider using either the customer's IP address, or a shared one from our pool. Once we establish a connection, we blast out as much mail for that user to that email domain as the inbox provider will allow. We rinse and repeat, ensuring that each customer and each domain get a fair opportunity for delivering mail.


Testing Transactional Email

One of our recent and interesting challenges is how we ensure system fidelity as we moved and continue to move from a Perl focused stack to Go. There are basically two schools of thought on how you can achieve this: test all the things or quick deployments and quick reverts in production. We've done a lot of both. We've put a bit of effort in increasing our tooling and monitoring around quickly deploying and quickly reverting. In fact, just yesterday, our newest team member and I deployed a new caching strategy and it only made it on to one production node (after working great in our pre-production environments) for around a minute before it was reverted.

There are certain things that are nearly impossible to test before production, and in this case, it was iptables. Production specific settings aside, we have a system where customers expect things to just work™. At our scale, and on our team, using production as testing could potentially result in lost mail. At the same time, we need to refactor and/or rewrite code to keep up with our growth. To be able to do this with confidence, it requires tests, tests, and more tests.

We have our suites of unit tests, unit-integration tests (tests that verify our immediate integration with databases, caches, and 1st order connected services), and system-integration tests that actually use our email sending API to send an email through our entire system and ensure that expected events are sent to user's webhooks and messages are received to inboxes with the expected data.


Dockerizing SendGrid

For that unit-integration layer, we leverage Docker. Our incoming edge is when the upstream service is finished processing a message and hands it to us for delivery, and then our outgoing edge is actually communicating with someone's inbox. We don't actually want to set up a bunch of receiving MTAs and such, but we still need to test behavior at that layer. Our solution is still a work in progress, but it gets the lion's share of use cases covered so we can confidently refactor and push new features and know we did not break anything.

This Docker setup leverages DNSMasq for setting up MX and A records and ensures they point to running mock inbox sinks. These inboxes are configured from a base image with multiple options. We can specify that the sink's TLS certificate is expired or improperly set up, we can have them respond slowly or with given errors at different SMTP conversation parts. We can ensure that we are backing off and deferring email if the inbox provider says to do so. This detailed faking of the outside world allows us to automate all kinds of outside behavior and ensure that our services behave as expected.

In addition to the fancy Docker setup, we have captured and sanitized production logs for the behavior of our legacy Perl MTA, and we can test that the log output from the new Go version behaves the same way as the old version. These tests are set up to allow us to switch between the legacy and new version of the MTA and ensure that both systems behave in a legacy-compatible way. Not only can we ensure that we operate against a variety of issues we've seen over time from inboxes, but we know that the newest version of our MTA continues to cover all the same expected behaviors of the legacy version.

Oh, and these tests are still fast. All of our unit-integration tests are run and an artifact produced and ready for deployment in under five minutes in our CI system. If it is not pulling Docker images, our local development environment can run these unit-integration tests in under 10 seconds.

We develop locally in Docker, as we just went into. Our docker-compose file spins up containers with fancy DNS settings and all our dependencies, allowing us to test the MTA against a variety of MX and TLS settings, alongside a variety of potential inbox responses and behaviors. Everyone uses their editor of choice and we often pair up on more complex tasks to prevent siloed system understanding.

When we've gone through code reviews (every code and config change goes through a code review) and feel good about the level of automated testing (no one can sign off on their own code's functionality; a quality assurance engineer or other developer has to verify functionality), we merge our changes via a bot that interacts with GitHub (the bot maintains our versions and change logs). After BuildKite has a green build and our binary is shipped to our repo servers, we are good to roll out deploys to our data centers and to keep pushing the needle on the performance of our system.


In Closing

Our systems are changing, our capacity is increasing, and our problems continue to be interesting. Every Monday, I'm still excited to show up, roll up my sleeves, and help push our product forward to tackle greater and greater scale. It is great to be part of a company that strives to keep innovating and improving and to better serve our customers every day.

SendGrid is always looking for top talent to join our team. For more information on positions at SendGrid, please visit our careers page at: https://sendgrid.com/careers/.

Twilio SendGrid
Twilio SendGrid is a digital communication platform that enables businesses to engage with their customers via email reliably, effectively and at scale.
Tools mentioned in article