Timeout in a distributed system (Microservices)

harish bhattbhatt
4 min readMay 6, 2024

--

Imagine you schedule a meeting with someone and agree on a specific time. Just like you wouldn’t wait indefinitely if they don’t show up after a reasonable amount of time, a timeout in communication acts as that “reasonable amount of time.” It sets a limit on how long you’ll wait for a response or completion of a task before moving on.

Similar to how you might have other commitments or tasks after the meeting, a timeout ensures that other processes or requests aren’t held up indefinitely if one gets stuck or delayed.

Always implement timeouts for external dependencies like databases and microservices. If supported, propagate the timeout value to downstream requests.

Empowering Clients with Timeouts in Microservice Architectures

When building HTTP or gRPC microservices, ensure they accept and adhere to specified timeouts.

While there’s no explicit HTTP standard for clients to set timeouts for APIs directly, there are established approaches that you can implement in your API to enable client-controlled timeouts:

1. Custom Headers:

  • Define a custom header in your API that clients can use to specify their desired timeout. For example, you could use a header named X-Timeout or Timeout.
  • If no timeout is provided, set a reasonable default value (e.g., 30 seconds) on the server side.

2. Server-Side Timeouts:

  • Set server-side timeouts for individual API endpoints or globally. This ensures that long-running requests don’t block other clients.
  • Use this approach to prevent resource exhaustion on your server.

Best Practices and Considerations

  • Clarity and Consistency: Document your API’s timeout mechanism in your API documentation, including the supported header (if applicable), default timeout value, and any potential error codes returned when timeouts occur.
  • Reasonableness: Set appropriate default timeout values based on your API’s typical response times. Avoid excessively long timeouts that could lead to resource issues.
  • Error Handling: Provide informative error responses (e.g., 408 Request Timeout) when requests exceed the specified or default timeout.
  • Client-Side Implementation: Clients should be able to handle timeout errors gracefully, potentially implementing retry logic or fallback mechanisms.

Propagating Timeouts Downstream

  • Once your API receives the timeout value (either from a custom header or default), it should propagate that value to all subsequent operations within the request execution.
  • This includes database queries, downstream service calls, and any other time-consuming tasks.
  • Database Timeouts: Many databases support query timeouts. When executing database queries using the provided client libraries or database-specific syntax, you can specify a timeout value.

PostgreSQL

https://navicat.com/en/company/aboutus/blog/2237-setting-query-timeouts-in-postgresql

DynamoDB

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html#withRequestTimeout-int-

If the database doesn’t support timeouts, consider using client libraries like Polly (.NET) or Resilience4j (Java) to implement timeouts and fallback mechanisms.

Polly (.Net)

Resilience4j

https://resilience4j.readme.io/docs/timeout

Managing Remaining Time

  • Variable for Remaining Time: A variable named “remaining time” (or a more descriptive name like “available execution time”) is crucial.
  • Reduce Timeout for Subsequent Operations: As each step in your API execution consumes time, subtract the elapsed time from the remaining time and use the updated value as the timeout for the next operation.
  • Return an error response to the client if the remaining time reaches zero before an operation completes.
  • This ensures that the server doesn’t perform expensive operations that exceed the client-specified or default timeout.

Benefits:

  • Client Control: Clients have flexibility in setting timeouts based on their needs.
  • Error Handling: Timeouts are handled gracefully, preventing resource exhaustion and ensuring SLOs are met.
  • Resource Management: Resources are freed up faster for other requests, improving overall system performance.

By implementing these practices, you can create a robust and flexible API that effectively handles client-controlled timeouts, ensuring a smooth and efficient user experience.

Client-Side Timeout Handling Approaches

Here are the common approaches clients can employ to handle timeouts effectively

Retry with Backoff:

  • Most common approach, but understand retry recommendations, use exponential backoff, and limit retries (e.g., 3–5).
  • Implement circuit breaker to prevent overloading downstream systems during failures.

Context-Aware Retries:

  • Utilize “retry-after” headers for server-defined retry intervals.
  • Respect “no-retry” headers from overloaded servers.
  • Check operation status with an endpoint before retrying (avoid unnecessary retries).

Defaults, Alternatives, or Cached Content:

  • Provide default responses (e.g., generic ranking instead of user-specific).
  • Use stale cache data if available.
  • Employ fallback services (e.g., secondary weather service).

Error Propagation (Least Preferred):

  • Only use if consequences are known.
  • Generally a last resort due to lack of better options. (or developer had done poor job of implementing solution)

Remember, the optimal approach depends on the specific context and desired user experience.

Additional Considerations for Timeout Handling

Choosing the right timeout value requires balancing user expectations with technical considerations.

  • Too short a timeout leads to false positives and unnecessary retries, while a long timeout degrades performance and user experience. Consider domain knowledge, historical data, and load testing results to find the optimal balance.
  • Continuously monitor timeout occurrences to identify potential bottlenecks or service degradation. Set up alerts to notify developers when timeouts exceed predefined thresholds.
  • Thoroughly test timeout handling mechanisms during development and integration testing to ensure they function as expected under various conditions.
  • Microservice might receive a partial response within the timeout window. This allows for providing users with some information even if the full response is delayed.

References

https://developers.google.com/api-client-library/java/google-api-java-client/errors

https://vinsguru.medium.com/resilient-microservice-design-with-spring-boot-timeout-pattern-72b5f5174d2a

--

--