alphalist Blog

Testing in Production and other Tips on Observability

Share

Charity Majors tests in production. And is proud of it. In this article Charity shares with us why she believes in testing in production and how to do it right using the current tools and observability.

Table of Contents

Charity advocates for observability-driven development which calls for writing code with one window open for your IDE and another window open for production. You should be instrumenting as you go. This means just like when you're commenting on your code, you have a little thread in the back of your head going” future me is going to want to know this” - but comments fail when they are just a description of your system without reality. When you have comments rooted in reality using Instrumentation - it's not going to lie to you in the same way or go out of date so quickly.  Of course, there should be some tests that code should pass before they reach production but to continually test in pre-production environments has a diminishing rate of return. Some bugs just don’t exist out of a production environment, especially as your team gets more and more skilled.  

Benefits of Testing in Production

Tight Feedback Loops

If you're instrumenting as you go and deploying changes out within minutes - all you need to know to solve any bugs is still in your muscle memory. You look at the live production of the code you just wrote and can see it through the lens of the interpretation you based it on and you can ask it ‘ is it doing what I wanted it to do’? Charity claims that you can find 80% of all bugs in that tight little feedback loop before customers ever experienced them. Facebook researched this concept a few months ago and they discovered the cost of finding and fixing a bug goes up exponentially from the time you write the bug until it's found. Charity backs it up “when you are writing a function, you make a typo, you backspace it. That's fast, right. As soon as it gets out there ..it just gets harder and harder and more and more expensive to locate it and find it and fix it” This avoids the confusion involved in context switching Having tight feedback loops isn’t new. Many people experienced it back in the old days when they uploaded files to a live site via an FTP. Fixing problems as they came up was cheap and easy. This changed with the introduction of DevOps and the more sustainable additions of the CI/CD pipeline and testing. But these tight loops are still possible in any setup with the right tools. For example, Kubernetes has a feature called telepresence which allows you to replace a production container with your local system.  Speeding up Deployment Time. The quicker you push to live, the faster you produce and the better your ‘Four-Door Metrics’ ( indications of how high-performing your team is). Engineers waste so much time waiting for things - people, tools. Big CI/CD pipelines mean that every step takes longer - the coding, because of the large pipeline.. ‘Maximum Deploy Time should be 15 Minutes’ says Charity. The faster the better. You don't want to wait an hour for deployment - by then your instincts will have weakened and you may have less helpful information stored in your muscle memory.   

The Right Tools for Testing in Production. 

There are amazing tools out there that allow people to test in production in a low impact, safe way like feature flags, routing mesh, observability. Your ‘Testing in Production’ toolset should feature the following:

  • Progressive Development - you should be able to ship to one node and progressively raise the gate to send more traffic
  • Tight Feedback Loops - be able to see it live as you make it
  • Dashboard which allows for filtering by very fine parameters
  • Multiple versions running in production (optional) - this allows for quick insight.
  • Automatic deployment - no human gates once you release The previous technique of adding log lines everywhere has given testing in production a bad name. It happens too often that you put the logline inside a loop, instead of outside, and suddenly you've taken the logging cluster down or you've exhausted the local memory in your desk.

The Need for Observability

Testing in Production works best when you have a good system for observability in place. Observability isn't just for teams that 'test in production' though. When the system is slow - what started it? If you don’t know what you changed, it is like looking for a needle in the haystack. Facebook built an internal tool for this called Scuba. It had a terrible UI but it allowed users to slice and dice in real-time on high cardinality dimensions. One can filter things by app ID, then the endpoint, then the build ID, etc. This was not available before this. This new technology drastically reduced the time it took to solve problems "from open-ended days maybe to like seconds like it wasn't even an engineering problem anymore. It was like a support problem.” says Charity. When she left Facebook, Charity just couldn't see herself working anywhere else without a tool like that. She didn’t want to go back to that feeling of fumbling in the dark for a long-forgotten error. So she developed her own - called HoneyComb.  The idea is simple. Every single component in a microservice mesh generates time-series data yet using the old methods of just dashboards - one would need to predefine what metrics they are meant to be tracking to best catch a system failure. But most errors aren’t uniform. 

How HoneyComb Works

1. When a request enters a service, an empty honeycomb event is initialised. It is then pre-populated with a variety of parameters that describe the environment. One can even add more parameters like shopping cart ID.  2. At the request’s exit or error, the data in its honeycomb is sent as an arbitrarily wide structured data blob - a materially intermitted system that might have 300 - 500 dimensions. It contains things like a request ID and span ID. 3. One can find correlations between the data captured in the requests. E.g. There's a spike in errors or something. What do all of those areas events have in common? Oh, it's all of the events that are this build ID for this language pack for this device ID.