Make API Gateway of micro service architecture more robust

My microservice architecture consists of several different web services that run in docker containers and are accessible via REST. My frontend web application accesses these micro services via an API gateway, which has been realized with spring boot (it is also a webservice, it just acts as a single interface to the others). Sometimes one or several of the backend webservices can go offline or be unavailable due to processing large amounts of data. In that case the frontend needs wait for a long time for a time out from the API Gateway or an error message. Because the backend webservices also communicate between each other over the gateway, a failure in one can lead to long wait times in another, which excerbates the problem. I would like to know solutions which could make my architecture more robust and increase fault tolerate at the API gateway.
1 answer

Add Hystrix

Hystrix ( is a fault tolerance library developed by Netflix, which uses the circuit breaker pattern to increase the fault tolerance of your Java code. It works by wrapping your code in so called Commands and providing Fallback Methods.

It is integrated in spring cloud, with the spring-cloud-starter-hystrix maven artifact.

The spring integration allows you to declare your service methods as Hystrix commands with annotations, for example:

@HystrixCommand(fallbackMethod = "fallback", commandProperties = {
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = TIMEOUT)
public Object callService() {
//insert code to send a request to one of your microservices, which could fail, here

private Object fallback(){
//insert code to execute if service fails or takes too long to answer here

With above code snippet you can declare a time out for your method, after which the fallback method will be executed. Furthermore the fallback method is executed should an exception occur or if the method failed too often consecutively (circuit breaker pattern). This allows for fail fast semantics and the failing/overwhelmed microservice you are trying to call is given time to recover.