Microservices Design Principle: Resiliency

Last Updated on September 23, 2023 by KnownSense

Resiliency design principle of mircroservices is all about making our software reliable and available and ensuring that if we do have any failures within our back‑end microservices, each service fails fast and there’s alternate functionality so that the user doesn’t face any disruption.

Let’s discuss the various steps to develop a resilient system

Design for Faliure

The first key idea is that each one of our microservices should be designed to expect failure of downstream dependencies and services. So in this situation where our product microservices failed, our Backend for Frontend API microservice should be able to handle this situation as elegantly as possible without causing a cascading failure, where because the product microservices failed, our Backend for Frontend API throws its own separate error, and then the error is basically chained down back to our user on the client application. So as well as elegantly handling the failure of our downstream microservices and providing alternative functionality, we should also assume our network. i.e. the connectivity between our components, is also prone to failure and issues, issues like latency.

So in above situation, our order and promotion microservice are perfectly fine, but the connectivity between our Backend for Frontend API and these two microservices has been lost, and it’s the responsibility of the Backend for Frontend API microservice to quickly detect this issue and then quickly respond back to the client application so that the user doesn’t have any disruptions. It might be that we temporarily display a message saying that there’s a temporary issue and the user must try again later on.

You should also design for and assume that third‑party APIs can be inconsistent and reliable and could possibly have connectivity issues, and therefore, you should design for this type of failure.

In this scenario, we’ve created a dedicated Postage microservice which handles any issues from this third‑party API. Again, this is a useful pattern for using a microservice to shield an unreliable third‑party API and handle the error situations in that microservice so that it doesn’t affect the rest of our architecture. Instead, we can use the postage microservice to provide alternate functionality when we’ve got issues with this third‑party API. So in this situation, when the third‑party API fails, the alternative functionality that the Postage microservice provides is to fetch a previously fetched result from a local cache instead of connecting back to this unreliable API.

Resiliency Patterns

The good news is, as well as using a cache, there are other resiliency communication patterns that we can use to cope with these type of downstream failures.

One of these patterns is known as the circuit breaker pattern, which is basically a bit of special connectivity code that you use within your calling client or service application, and it basically allows that service or application to keep a count of how many failures it receives from a downstream component. After a certain number of retries, it will basically stop bothering to trying to connect to that downstream component, and instead, it will allow the service’s code to fail quickly so that we can respond back to our client application more quickly. At the same time, we can provide that alternative functionality, like fetching data from a local cache instead of from that failed downstream component. And the circuit breaker pattern will eventually try that downstream component again to see if the issue has been resolved.

The other resiliency communication pattern that we’re using here is the retry pattern, which basically allows us to retry a connection if we receive a failure from that connection, and it works well with the circuit breaker pattern. The retry pattern should always be used in conjunction with something like a circuit breaker because the last thing you want to do is infinitely retry, especially if your microservices architecture initially broke because there was too much load, and the last thing you want to do is generate more load by trying too many retry connections.

Another pattern that should be used in conjunction with these two patterns is the timeout pattern, which basically allows you to set the amount of time you will wait for a connection to a downstream component to respond. So when you’re connecting to the unreliable downstream API, you don’t want to be waiting minutes or seconds. You probably want to wait 1 second so that your user at the other end on the mobile application or on the website also only has to wait for a short amount of time.

The other resiliency pattern we’ve already briefly mentioned, having a caching strategy. So in our scenario where our posted microservice is dealing with an unreliable third‑party API, on previous calls, we always cached the results we received from the Postal Info System API in a local cache at the Postage microservice end. Then when our third‑party API did stop responding, we were able to provide an alternative functionality, i.e. take a previously fetched result from our local cache.

So by using all these four patterns on all our downstream component connections, we can have our architecture and our software fail fast and provide an alternative functionality. In this section, I’ve only briefly introduced each one of these resiliency communication patterns.

Later on in the other upcoming articles, we will go in a lot more detail. And another thing that we will cover in more detail is the use of asynchronous communication and a component like a message broker, but it’s also worth mentioning in this article because a component like this that uses asynchronous communication also increases the resiliency in our microservices architecture.This is because when our microservices communicate to other downstream components via a message broker using asynchronous communication, the communication is in a fire‑and‑forget style, i.e. that we’re not waiting for a response, and this makes our architecture a lot more reliable. This is because if the recipient of that message is currently down or facing issues, it doesn’t have to pick that message up from the message broker immediately. When the issue is resolved, the downstream service can then pick the message from the message broker and then process that data.

Active Backups

Another key strategy in order to achieve the resiliency design principle is to have active backups of everything. The key word here is active, and this basically means your backup software or hardware components are actively receiving a load of the traffic. Even when there’s not an issue, they are actively functioning, and they’re not just passive backups that are sitting there waiting for an issue to happen to come into play. So, for example, our microservice instances are active backups because they’re receiving their share of the traffic via a load balancer day in, day out. The same idea of active backups applies to all components within our microservices architecture. We should have two or more of everything, and each one of those things should be active. This is because we hve seen many situation in the past where a company had a backup data center with a copy of the entire architecture within that data center, but when the main data center went down and they tried to bring the backup data center into play, none of that software architecture worked because the configuration and the data was completely out of date. It was a passive backup. This is why it’s key that all your backups are active and in play day to day, so when you do need to lean on them, when you do need to rely on them, they are actually already ready to go.

Network Health

Another key part of our resiliency strategy should be to maintain network health. Remember, within our microservices architecture, we have a lot more moving components, and all our components talk to each other via the network, and therefore, we should have a central monitoring system that monitors the network for connection outages, for latency issues, and for timeouts so that we can use all this information and all these metrics and stats for capacity planning. i.e. to increase the bandwidth in certain parts of our network to cope with the traffic. Now, because our internal connectivity between these applications and services is so key, we need a proactive stance where we’re ready to scale out the network when we need to.

Validate Connections and Data

Another key part of our resiliency strategy is that we validate connections and data so any incoming data to our microservices should be validated. If that data is not good enough. i.e. doesn’t meet the contract and the interface requirements, instead of correcting that data, we instead throw up errors and tell our client applications or services that this is bad data and you need to try again.
The same applies for microservices that receive their data via asynchronous communication via components like the message broker. If the message is in a bad format, a format that we don’t understand, we reject that message, and most message brokers support the idea of a dead‑letter queue where we can place messages that are unprocessable. Another important thing is centralized security. Each one of our microservices should validate all incoming connections in terms of authentication and authorization, i.e. as a user that’s trying to connect valid, and do they have the right privileges in terms of accessing specific data and functionality within this microservice?

Proactive Maintenance

And the next part of our resiliency strategy is a bit of a given. Because we have such a complex architecture now, we should have proactive maintenance in terms of data maintenance, in terms of log and data archiving, in terms of capacity planning, in terms of scaling out, in terms of increasing availability, and the key of all, having central logging and monitoring so that we can proactively foresee issues that are currently brewing.

Conclusion

In summary, embracing the resiliency design principle in microservices architecture is essential for delivering reliable and available software. It involves designing for failure, using resiliency patterns, asynchronous communication, active backups, network health monitoring, data validation, and proactive maintenance. By implementing these strategies, organizations can ensure their microservices architecture remains robust and responsive, even in challenging circumstances.