When subscribing to Kafka topic, in Lagom and general, we could face a problem that our subscribe message processing is slower then a topic message publishing, ending up having a message consume LAG that increases over time.
So question is how to scale subscriber to maximize message processing throughput and avoid LAG?
Lagom subscriber is not a cluster based feature, meaning that subscribers initialized per service instance are not aware of each other. In multi service instance deployments, as mentioned in Lagom official documentation Subscribe to a topic, subscribe duplication is resolved by grouping subscribers by using groupId. This grouping is NOT a Lagom feature but is translated to Kafka consumer group feature. Lagom, by default, derives groupId value from a service descriptor name, ensuring that all subscribe instances, in one service type, use the same groupId and by that the same Kafka consumer group.
So one Lagom subscriber instance is one Kafka consumer and groupId is a Kafka consumer group.
Subscribing is a process composed of topic partition message consuming and processing.
For consume scale we need to know how consume scalability is solved on Kafka level. Kafka uses consumer groups to scale topic partition message consuming. For more details check this blog Scalability of Kafka Messaging using Consumer Groups.
The consumers in a group then divides the topic partitions as fairly among themselves as possible by establishing that each partition is only consumed by a single consumer from the group.
From the quote we can conclude that we should have Lagom subscriber instance (instance of Kafka consumer) per topic partition.
So by increasing number of topic partitions and equally number of Lagom subscriber instances we can scale message consuming.
When modifying number of topic partitions please be aware of this: Apache Kafka – Modifying topics
We need to ensure that we are instancing Lagom subscriber per partition. There are few options we can use to achieve this:
- one subscriber instance per one service instance
- if we instance one subscriber per service instance then we need to scale service to number of partition
- PROS: no specific configuration needed
- CONS: in some cases message processing is fast but requires higher parallelism in consuming and scaling complete service just to achieve it can be an overhead (check #2)
- pre-configured (static) number of subscriber instances per one service instance
- instancing multiple subscriber instances per one service instance is done simply by calling subscribe() multiple times. We can then pre-configure number of subscriber instances call subscribe() in a loop.
- PROS: no overhead indicated in #1
- CONS: when scaling service instance, number of subscriber instances grows exponentially and is hard to control. But if you do not change number of service instance this options would be sufficient because it is easy to implement
- cluster based subscriber instances
- idea behind this is to configure, on cluster level, number of subscriber (equal to number of partitions)
- we should have a cluster based coordinator that would track number of active subscribers in the service cluster and scale them accordingly.
- implementation would be something similar to Lagom ReadSide processor implementation where Cluster sharding is used and number of ReadSide processors is indicated by number of event shards.
- this implementation is out side of scope of this blog 🙂
Scaling message processing depends solely on Lagom by scaling Lagom service instance.
Message consuming and processing are mutually connected so we need to find a right balance between Lagom service instance scaling and scaling number of subscriber instances per service instance.