The Microservices Dialogue: Navigating the communication landscape

Distributed systems like microservices havily rely on collaboration between each other. It’s always important to use appropriate tool and communication is not an exception here.
How microservices communicate?
There’re two basic ways to communicate across microservices: synchronous and asynchronous.
Request/reply (synchronous) services are communication by making a requests and receiving the response in return. Examples:
- HypterText Transfer Protocol (HTTP): a text-based protocol for exchanging data.
- Protocol Buffers (ProtoBuf) and gRPC: ProtoBuf is an interface description language and binary serialization format designed and open-sourced by Google.
- GraphQL: provides an expressive language for clients to describe the data they want. It’s useful for federating data from multiple sources of data. It’s commonly used for BFF between backend services and client applications.
- Simple Object Access Protocol (SOAP): it’s quite heavy and not very popular, but there a lot (mostly legacy system) that use it .
- Apache Avro: a row-oriented RPC and data serialization framework.
Publish/Subscribe (asynchronous) microservice publish a message to multiple other service. Examples:
- RabbitMQ: open source message broker software.
- Azure Service Bus: enterprise integration serverless message broker.
- Apache Kafka: a distributed event store and stream-processing platform.
How to choose the communication type?
There’re no clear answer to what better to use. Syncronous type is good for real time communication. For example: retrieve the data from other service via request/response. On the other hand I prefer asyncronous type when it comes for long-running operations. Both are good in a certain ways. You might end up with combining multiple ways of communications and protocols in a single system.
The optimal way is narrow down your choices by answering the questions like:
- What is the data format?
- Is the order of the messages important?
- What is the size of the message?
- How many messages do you intend to send per second/minute etc.?
- Can you tolerate message loss? How can you recover from them?
- Do you need response? (direction of communication: one way/bi-directional)
- Is the communication one-to-one or one-to-many?
- Does the service need to be hightly available and fault tolerant?
- What’s about the cost and operational effort for each solution?
P.S: It’s okay to not have asnwers to all of them. Also you don’t need to do it on regular basis.
How to increase fault tolerence?
There’re different techniques and patterns that helps to increase faults tolerance for you system, but make sure you actually solve the real problem with them. It’s just high level overview to give you an idea and the implementation itself might be different based on the needs.
- Listen For Yourself: the publisher service subscribes for they own events. It helps prevent/identify transient issues. For example: the message is not even get to the queue because the service itself is unavailable.
- Outbox Pattern (transactional messaging): we insert events into database as part of the update transaction then fetch events from database in separate process and publish them to subscribers. It helps reduce potential data inconsistencies when you have an error and the change is not commited, but event is already published.
- Service Mesh Pattern: infrastructure layer that control/mediate communication between services (distributed traicing, health checks, metrics etc.). It helps in observability, testing, debugging (you can simulate different scenarious) and more. On the other hand it introduce extra complexity, single point of failure (from configuration perspective) and increase resource consumption.
- Circuit Breaker: prevent an application from repeatedly trying to execute an operation that’s likely to fail.
- Retry: enable an application to handle transient failures when it tries to connect to a service or network resource, by transparently retrying a failed operation. I prefer to use it in combination with circuit breaker + inceased timeout per each retry.
- Throttling: control the consumption of resources used by an instance of an application. Introducing limits to your API can help you increase control over resources usage.
- Claim check (zero-payload): publish only name/identifier of the message. The payload itself is saved in database/cache or can fetched via HTTP calls. It’s less “popular” than the others, but it good for operations with complex data structure or large amount of data.