Best practices for retry pattern

harish bhattbhatt
6 min readApr 11, 2021

--

Retries are a core resiliency pattern which help enhance service availability by re-attempting failed operations. Enabling application to handle transient failures when it tries to connect to a service or network resource, by transparently retrying a failed operation can improve its stability. Any modern day cloud native services or applications require to handle transient faults. Transient faults are any faults which are temporary and will mostly resolve in few milliseconds to seconds. Example of transient faults are rate limiting by cloud services, momentarily loss of network, timeout as server is busy. So, overall by retrying the operation with logical amount of delay has fair chance of success.

So, if we do not retry for transient errors we may send the error back to our clients in far more cases then we would like, this will frustrate the end users and we may lose a customer, apart from this their might be possibility of SLA defined based upon the successful request percentage which might be breached due to this transient errors.

As we understood why retry is critical from cloud applications, let’s jump into some best practices for implementing it

Understand that operation failed is suitable for retry

  • Check the response and try to get the status code and/or error code, validate the documentation to learn the type of error and try to find out the answer to the question, what is the chance of it being successful if retried ? is it transient error? If answer to above questions is “Yes” then only it make sense to retry. For example, if we are trying to insert row in table and if that table does not exist there is no point in retrying that operation. Overall, don’t retry permanent errors.
  • Do not retry if downstream service is overloaded. Check the response/error code for the needed information.
  • In general, retries can be done only when we know the full impact of it or aware of entire life cycle of the operation in question. For example, if we have created a notification service(SMS/E-mail) which is wrapper over sendgrid, Twillio, AWS SNS etc. and there are N client which consumes it from their respective backend services and apps. So, in this case if there is transient error thrown by let’s say sendgrid, it might be more appropriate to retry it by applications rather then by notification service.

Use Exponential back-off for retry

  • Always, always use exponential back-off for retries. Exponential back-off is to increase the time between every retry attempt, so for example if we decided to use 3 retry attempts and first attempt is after 1 second of failure the next would be after 2 seconds and last would be after 5 seconds. This strategy provides the breather for downstream service to recover from any failures or high work load.
  • Use randomization with exponential back-off, rather then every retry attempt follow the standard template. For example, one instance may retry the operation after 1seconds, 2 seconds, 5 seconds, and so on, while another instance may retry the operation after 2 seconds, 4 seconds, 7 seconds, and so on.
  • Depends upon the operation, one immediate retry can be attempted before moving to exponential back-off, but bear in mind that no more then one immediate retry attempt.

Determine the number of retry attempts and interval between it

  • Do not retry indefinitely. Identifying and deciding on number of retries and interval between them is difficult problem to solve and it depends upon the type of operation being attempted. Use a finite number of retries and use circuit breaker after it for downstream service to recover.
  • If operation is part of the user interaction then the number of retries should be limited(read 3 to 5) and interval between retries should also be in few milliseconds to seconds. Overall, entire operation with retries should not take more then few seconds.
  • If operation is part of the background process, such as data pipeline or scheduled executions then number of retries could be more and the interval between retries could also be higher.
  • Check the failure response coming from operation you are trying, it might have provided the “retry-after” header or similar. If it exists do use it for your retry interval. Never forget to check the operation error code, response and headers after every retry attempt, as downstream service might have changed it and sending the error which is not retriable.

Ensure that retries are not done on operations where consumers are not waiting for response

  • Try to get the value for the “timeout” header from service consumers and if not supplied use the logical default for the operation. While retrying it needs to be checked that the total time of operation do not exceed the timeout specified, as it does not make any sense to retry the operation after consumer has decided that request is failed due to timeout.
  • Based upon the operation there could be provision to send the cancellation for the operation or way to learn that consumer is not waiting for response any more (i.e. may be it has closed the connection). If those signals are available, use them and do not retry.

Consider having a server wide retry budget

  • One of the major challenge with retries are that they can cause the cascading failures. It is important to slowdown and provide breather to downstream service to recover from failures. Based upon the service, a retry budget can be defined. For example no more then 50 retries per minute. If number of retries exceeds then fail the request immediately. The retry budget can be defined ate multiple level i.e. per minute, every ten minutes.
  • Use circuit breaker if retry budget is exceeded

Avoid amplifying retries by issuing retries at multiple levels

  • Think about the service holistically and decide that do we really need to do the retry at given level. When in doubt do not retry and let consumer of the operation take that call.
  • In general if retries are done at multiple level it can leads to cascading failures and degradation of service. For example, if database is not responding to the request then backend, frontend and javascript all can retry 3 times, which leads to 27 retry attempt for the operation. Avoid this anti pattern, as this behavior is undesirable and it can overload the database and backend services.

Monitor and log the retries and operations

  • Log exception details, fault code, retry attempts, time taken by API (including the number of retries and time taken by every retry), what caused the retry (probably the error message received from dependent service). Analyze this data to understand the root cause.
  • Implement a telemetry and monitoring system that can raise alerts when the number and rate of failures goes beyond limit identified.

Check on operations which are failing consistently

  • Set the alerts based upon the operation, API or server which is showing high number of failures
  • Try to make health check API intelligent enough that it can understand the internal state of the service and make itself unavailable if not in a state where it can serve the request. For example, if service has lost connectivity to database and storing to database is must for success, then it make sense to fail the health check even though service itself is healthy and available.
  • Subscribe to health check for dependent resources and check on the metrics available (i.e. throttling, CPU/memory overload)

Other considerations

  • When retrying consider the scope of retry, if there are multiple operations as a part of request check at what level retries are required.
  • Never wait synchronously between retry attempt. For example, use “Task.Delay” in place of “Thred.Sleep” in .Net
  • Ensure that downstream operation does not have any side effect due to operation executed multiple times. For example, if you are incrementing the number then it might be possible that it includes the duplicate count. To prevent this, ensure that you design each step as an idempotent operation.
  • Consider how your retry strategy may affect other tenants in a shared application, or when using shared resources and services. Use rate limiting and retry budget per tenants.

Closing remarks

Use the retry pattern when an application could experience transient faults as it interacts with a remote service or accesses a remote resource. But do take care of what response/error code is returned from downstream services before deciding to retry. Without proper care, you may be victim of retry anti patterns such as retry amplification, cascading failures, blocking threads and noisy neighbor.

Do not use retry pattern when a fault is likely to be long lasting, because this can affect the responsiveness of an application.

References

--

--

Responses (3)