How Raygun Solves Performance Issues at 100M API Calls Per Hour

4,419
Raygun
Raygun lets you detect, diagnose and resolve issues in your web and mobile apps with greater speed and accuracy

This Promoted Post Was Written By John-Daniel Trask, Co-founder & CEO, Raygun. Inspired By Their Recent Webinar - Writing high performance .NET code


As a developer, you often feel there is a pressing need to fix issues you regard as important, but your manager doesn't always agree.

I'm sure your team could talk all day about the fast accumulation of technical debt. The priorities seem to be all out of whack with how you see things. You feel you can't deliver on time due to all the more pressing problems you see, but nobody else seems to appreciate.

Often as a developer, if we can't present a business case, the plea to fix things we deem important gets ignored.

Colleagues outside of the development team inner circle have different ways to measure success. They have to consider customer requests, impact on revenue and wider business initiatives.

The thing is, regardless of what anyone deems important or not, the worse a customer's digital experience is when using your software, the further it takes everyone away from their goals.

It's a tricky balance. Nobody sets out to create BAD software experiences for customers but it happens, a lot!

Slow and buggy software is a hard problem to communicate in business terms anyway. General sluggishness of the web or mobile application might not generate customer complaints via support tickets, yet is negatively impacting end users.

The thing to remember is that if poor performance or generally bad software experiences end up losing a customer, it's always costing someone money.

For context, we offer error monitoring, frontend performance monitoring and application performance monitoring in one integrated platform. These products help developers find and fix errors, crashes and performance issues in their code.

The challenge is that we have to process a huge amount of data - around 2,000 requests per second and growing. In order to handle that efficiently, we put different systems and tools in place.

You can read more about Raygun's infrastructure and why we made decisions like moving from Node.js to .NET Core on StackShare here.

In this post I'll be covering how prioritising performance work can result in huge wins for you and your team, and how relating this back to your peers in business terms can help you be a more effective developer.

Monitoring what matters

We spend a lot of time ensuring that our infrastructure is performing reliably and efficiently.

That allows us to make sure our products aren't wildly expensive for the sheer amount of data that comes in, because that does cost us money.

Also, having that low cost to serve gives us the ability to do things like funnel our money back into our own business so we can grow without doing external funding.

Soon after we first launched Raygun in late 2012, we got a trial sign up from a company running a top ten Facebook game, that created 3MB of crash data per second - that's 6,660 crash reports per minute - just from one customer.

Though it almost broke us, thankfully we'd done enough work upfront to handle this influx - just. But it made us think about the future, and what we would need to do to handle much more data as we grew our customer base.

As a result of the strategies we put in place, Raygun now processes around 100 million API calls per hour, and each one of those can have over 100KB of data attached without causing any trouble for our systems.

With this in mind, when we aim to understand the cost of performance issues ourselves, we correlate our data processing time to compute time.

For example: If it takes 100ms to process a message at 100ms per hour, that's 115 days' worth of compute time.

If we can work on performance and reduce the processing time from 100ms to 10ms, that volume of data would take a fraction of the time—just 11 days.

That's a fairly simple equation, but it can help put things in perspective for less technical team members.

When you are looking at performance optimizations, you tend to want to do some measuring around where the actual problem lies. I see a lot of folks just assuming there's an issue.

I'll give you a concrete example, which is that we have a real user monitoring product which is tracking the load time that users experience. When that data flows through our API, it's queued for processing. We then have our RUM worker that's doing its thing processing that queue.

We have some absolutely colossal customers using that product and were suddenly going, "Man, this needs to go faster. There's so much more data that it needs to process."

Of course, we looked at this and we thought, well, it's touching our Postgres and Druid databases nonstop throughout the critical path. Clearly, talking to another server is going to be a fairly big overhead in performance.

But when we sat down to actually profile it, it turned out that part was negligible. We had a user agent parser which we custom-built in-house to look at the user agent string for the request and determine which browser and browser version it came from. It essentially parses the user agent string from our web customers. And that thing was gobbling up easily about 80% of the time in this case because it was allocating too much memory due to string manipulation.

We're typically, as human beings, not the best at guessing where these issues might lie and so it helps to always measure.

A simple approach to monitoring and alerting

Asking a series of questions, we can very quickly assess where we should spend engineering time.

Is this costing us customers?

  • Are we so slow users are unhappy and leaving?
  • Are tasks taking so long that users disengage?

Is this costing us money?

  • If the problem isn't impacting customers, we consider the development effort needed
  • Engineers are actually pretty expensive, servers often are not
  • Speaking of servers, do we have too many?
  • Consider the wider costs outside just your time (updating docs, communicating changes, and testing)
  • Is it causing customers to churn (lost customers)?

We use a series of tools to detect potential problems, or 'smells' that could lead to performance issues. Raygun's core application is a Microsoft ASP.NET MVC application.

Local profiling

The JetBrains suite is an incredibly helpful addition to your developer toolbox and the dotTrace profiler is no exception. dotTrace can be attached to almost any kind of .NET process, whether it's a console application or an already running website within IIS. It's very easy to get set up and will give you a lot of detail.

One of the best things about dotTrace is that it will save all your snapshots from your previous runs. For example, this is a trace from a usage counting service within our system.

From a high level, this request is taking around 3.8 seconds to load. After taking a deeper dive, we can see that most of the time is being taken by a database call. Doing so, we can trace the cause to MySQL.

In this case, you could consider not going to MySQL as often, and perhaps use a transient data store like Redis, then flush Redis every so often.

There are a host of other profilers you can use, for example, VisualStudio has a basic profiler built in.

Red Gate is also popular among our development team.

Using one of these tools when looking for code 'smells' eliminates guesswork, and you'll often be surprised where performance is going once you get set up with a profiler.

Diagnosing production issues

What happens if we know that we have an issue in production, but the problem isn't appearing in our development environment?

We want to be monitoring our performance in production so we can access this data.

It's always important to understand how the users actually experience your software. It's really easy on developer machines that are highly powerful machines and typically only have one user using the application, that it seems like performance is fine.

In complex scenarios where your code is executing in production with potentially thousands or even millions of concurrent users, being able to see that in production is really important.

I think there's been a shift in the market from going from server monitoring in the past, to wanting to include that real user part.

An example of that is, say, any application performance monitoring product that's tracking what the server's doing might tell you the "Server returned to the web page in 800 milliseconds."

That sounds like a pretty fast response time. We're all happy with that.

As the industry's moved in the last ten to fifteen years however, we've put so much more code on the front-end in our JavaScript, that it's not uncommon now, the server might have come back in 800 milliseconds, but the user then had to wait four and a half seconds for the page to actually be ready to use, because of all this code.

There's not really any easy way to get visibility on what the user experience looks like, as this is what the server's looking like. So it pays to use a frontend performance monitoring tool as well as a server side monitoring tool to get a view over the actual user experience.

To tackle the fact that load environments in production are complex, we of course use our own Application Performance Monitoring (APM) product.

If we have a hunch something's wrong, we head straight to APM, and the first screen we hit has all the information we need to do a high-level health check of our code in production.

Here we can see:

  • Apdex score, which is a computed overall health score
  • Execution time for each of the web requests or transactions
  • Load time distribution

In the screenshot below, we can see that the people affected by load time, most experienced a 2.6 seconds or longer wait time, while 50% of people or more were getting three milliseconds or less.

When we go to drill down further, you can see below that there wasn't much in the API calls, but the method execution was extremely high. It looks like a total outage of the website.

We then take a look at the slowest request:

If we click on this, we can see that it takes us to a page that contains all the requests contained within the timeframe we've selected.

There's only one request in the time period. We can then drill down and look at that particular performance trace as a flame chart.

The flame chart is helpful because we can see the exact request that a user had made to this URL, and see that there are database queries firing.

We see the actual queries themselves and what they came through as. This also integrates directly with your source control which saves time during resolution.

APM has become a vital part of our monitoring strategy because we can understand what is actually going on in our production environment that might be making things slow for customers.

We want to make sure that we're assuming nothing, measuring everything, and actually understanding what's going on. We want to also be able to demonstrate these issues to stakeholders as we are now able to show exactly what is wrong and estimate the resolution time.

BenchmarkDotNet

Raygun also relies on BenchmarkDotNet, arguably one of the most exciting tools for .NET that has landed in the last few years. It's become the real de facto standard in the .NET community for measuring performance.

BenchmarkDotNet does some incredible statistical analysis allowing it to exclude outliers that muddy your data. It's a very powerful tool, it's completely free and open source. You can find the whole thing on GitHub.

We find the most useful attributes to be:

  • Params, which allows you to run a new iteration of each benchmark with a new parameterized value
  • GlobalSetup, which says that something is going to be run before each benchmark iteration
  • MemoryDiagnoser, which gives you a look into what is being allocated and garbage collected in that benchmark. It shows you allocated memory and each of the garbage collection cycles
  • DisassemblyDiagnoser, which allows the assembly and intermediate language of a benchmark to be output

One thing to mention is that BenchmarkDotNet is designed for micro-benchmarks. If something was going to take 10 minutes to execute, the defaults may take around a day to get the data you need.

Therefore, ideally, you'd use it to look into small pieces of code that are performance sensitive.

Performance certainly is a feature

This post covers the high level items around improving your application performance. When building and maintaining software I encourage you to always be thinking about performance and how you can keep improving rather than leaving it to stagnate on the technical debt list.

Performance is most certainly a feature, so take the reins, measure, then improve it. Your fellow developers, managers and customers will love you for it.

You can watch the full webinar below to dive deeper into the tips and tricks for building high performance .NET applications which also covers in more detail the points discussed in this post.

Raygun
Raygun lets you detect, diagnose and resolve issues in your web and mobile apps with greater speed and accuracy
Tools mentioned in article