What this usually means
The consumer is not successfully subscribing to the right topics, partitions, or offsets. Common underlying causes include misconfigured group.id (e.g., using a new group that starts from latest offset by default), incorrect topic subscription (typo or mismatch with replication), consumer group rebalancing stuck in a loop due to session.timeout.ms too low or processing taking too long, or network/firewall blocking the broker connection. It can also be a deserialization issue where the consumer silently discards messages that fail to deserialize, or a partition assignment issue where the consumer is assigned partitions that have no data.
The first ten minutes — establish facts before touching code.
- 1Run: kafka-consumer-groups --bootstrap-server <broker> --group <group> --describe --members --verbose to see partition assignment and lag
- 2Check consumer logs for WARN or ERROR around partition assignment: grep -i 'partition' consumer.log | tail -50
- 3Verify topic exists: kafka-topics --bootstrap-server <broker> --list | grep <topic>
- 4Try manual subscription with assign() instead of subscribe() to rule out group coordination issues
- 5Test with a simple console consumer: kafka-console-consumer --bootstrap-server <broker> --topic <topic> --from-beginning --group <group>
The specific files, logs, configs, and dashboards that usually own this bug.
- searchconsumer.properties file: check group.id, enable.auto.commit, auto.offset.reset, session.timeout.ms
- searchConsumer application logs: look for 'Assigning partitions', 'Revoking partitions', 'Rebalance' messages
- searchKafka broker logs: /var/log/kafka/server.log for consumer group coordinator events
- searchOffset topic: __consumer_offsets partition 0 (use kafka-dump-log to inspect)
- searchProducer side: check if messages are actually produced and have correct key/value serialization
- searchNetwork: test connectivity with telnet <broker> 9092
- searchJVM thread dump: jstack <pid> to see if consumer poll() is blocked
Practical causes, not theory. These are the things you will actually find.
- warningauto.offset.reset set to 'latest' on a new consumer group — consumer will only read new messages
- warningTopic subscription uses a regex or pattern that doesn't match the actual topic name
- warningConsumer group rebalancing stuck due to session.timeout.ms too low (< 10s) or max.poll.interval.ms too low
- warningDeserializer mismatch: key/value serializers in producer don't match deserializers in consumer
- warningConsumer is in a different Kafka cluster or namespace (e.g., different bootstrap.servers)
- warningMessage headers or record filtering silently dropping messages
- warningConsumer thread hangs on a blocking operation inside the poll loop (e.g., database call)
Concrete fix directions. Pick the one that matches your root cause.
- buildSet auto.offset.reset to 'earliest' for new consumer groups that need to read existing messages
- buildEnsure consumer group uses the exact same topic name as produced (case-sensitive, no extra whitespace)
- buildIncrease session.timeout.ms to at least 30s and max.poll.interval.ms to 5 minutes to prevent false rebalances
- buildUse assign() with specific partitions and seek() to skip to a known offset for testing
- buildAdd a deserialization error handler (e.g., org.apache.kafka.common.errors.SerializationException) and log the bad record
- buildEnable consumer metrics: metrics.recording.level=DEBUG to see poll rates and lag
A fix you cannot prove is a guess. Close the loop.
- verifiedAfter fix, run kafka-consumer-groups --bootstrap-server <broker> --group <group> --describe and confirm LAG decreases to 0 or near 0
- verifiedAdd a metric counter in the consumer's poll loop to confirm records are being processed
- verifiedProduce a test message with a known key and value, then verify consumer receives it within seconds
- verifiedCheck consumer logs for 'Assigning partitions' and 'Revoking partitions' — should stabilize after one rebalance
- verifiedUse kafkacat -C -b <broker> -t <topic> -p 0 -o beginning -c 1 to read a raw message and compare format
Things that make this bug worse or harder to find.
- warningDon't assume the problem is the broker — always check consumer config first
- warningDon't blindly increase timeouts without understanding why rebalancing is happening
- warningDon't ignore deserialization exceptions — they often get swallowed in logs
- warningDon't use subscribe() with a pattern that matches more topics than intended
- warningDon't forget that consumer groups are persistent — deleting and recreating the group resets offsets to auto.offset.reset
The Silent Consumer: A Kafka Group Stuck at Latest Offset
Timeline
- 09:15Alert: 'OrderConsumer' lag exceeds 100k messages across 12 partitions
- 09:20Checked kafka-consumer-groups --describe: LAG=0, but offset=0 for all partitions
- 09:25Reviewed consumer logs: no ERROR, only INFO 'poll returned 0 records' every 3 seconds
- 09:30Produced test message manually: console consumer sees it, but our Java consumer doesn't
- 09:35Checked producer logs: messages sent to topic 'orders-v2' (with a typo in our config: 'orders-v1')
- 09:40Corrected topic name in consumer config: orders-v2, restarted consumer
- 09:42Lag starts dropping: kafka-consumer-groups shows offset increasing
- 09:50All messages consumed, alert cleared
At 09:15, PagerDuty screamed: our order consumer had 100k backlog. I SSH'd into the Kafka pod and ran kafka-consumer-groups. The group 'order-service-group' showed 0 lag, but the offset was 0 — meaning it had never consumed anything. The logs were clean: just 'poll returned 0 records' in an infinite loop. I produced a test message from the command line; the console consumer saw it instantly, but our Java app stayed silent.
I checked the producer side. The team had migrated to a new topic 'orders-v2' the night before, but the consumer config still pointed to 'orders-v1'. The consumer was happily polling an empty topic. A simple grep in our Helm chart confirmed the typo. I updated the config map, triggered a rolling restart, and within minutes the consumer started draining the backlog.
The root cause was a classic config drift: producer and consumer configurations were maintained in different repos. The producer team had updated the topic name but forgot to notify the consumer team. The fix was to centralize topic names in a shared config, and add a health check that verified the consumer is actually subscribed to the topic that has data.
Root cause
Consumer subscribed to a topic ('orders-v1') that no longer received messages; producer sent to 'orders-v2'.
The fix
Updated consumer bootstrap config to subscribe to 'orders-v2' and restarted the consumer group.
The lesson
Always verify topic subscription matches the actual data source. Use a config validation step in CI/CD to catch mismatches.
Rebalancing is Kafka's mechanism to distribute partitions among consumers in a group. It happens when a consumer joins/leaves, or partitions are added/removed. During rebalance, all consumers stop processing until the new assignment is complete. If your consumer logs show repeated 'Assigning partitions' and 'Revoking partitions', the group is stuck in a rebalance loop.
Common causes: session.timeout.ms too low (default 10s) causing false timeouts if the consumer's poll() takes too long; max.poll.interval.ms too low (default 5min) causing the coordinator to kick out a consumer that is still processing. Fix: increase session.timeout.ms to at least 30s and max.poll.interval.ms to 10min. Also ensure processing time per batch is well under these limits.
When a consumer group commits offsets, it stores them in the __consumer_offsets topic. New groups start with no committed offset, so auto.offset.reset determines where to start. The default value is 'latest', meaning the consumer will only read messages produced after it starts — a common trap. For bulk reprocessing, set auto.offset.reset='earliest' on first run, then switch to 'none' (which errors if no offset) to avoid future surprises.
Use kafka-consumer-groups --reset-offsets to manually set offsets for an existing group. Example: kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --reset-offsets --to-earliest --execute. This is useful after a fix to replay missed messages.
If the consumer cannot deserialize a record, it throws an exception in the poll() thread. By default, Kafka clients log the error and skip the record — but the logging level might be low (e.g., DEBUG or TRACE). This means messages can disappear silently. To catch these, add a custom DeserializationExceptionHandler that logs at WARN or FATAL, or store bad records in a dead-letter queue.
Common mismatches: producer uses StringSerializer but consumer expects Avro; or producer uses custom serializer but consumer doesn't have the class. Check the record's raw bytes using kafkacat with -o -1 to print the last message's hex dump.
A firewall or network policy can block the consumer's connection to the broker, but the Kafka client might not report it immediately. The consumer will keep trying to connect with metadata requests, and if it fails, it logs at DEBUG level. Check connectivity with: telnet <broker-ip> 9092 (or whatever port). Also verify the consumer's bootstrap.servers list includes all brokers.
On Kubernetes, use netcat on the consumer pod: kubectl exec <pod> -- nc -zv <broker-service> 9092. If connection succeeds but consumer still gets no messages, check the broker's advertised.listeners — if it's set to localhost, the consumer from outside can't reach it.
Frequently asked questions
Why does my consumer poll() always return an empty list?
Check if the consumer is assigned any partitions (kafka-consumer-groups --describe). If it has partitions but still empty, verify auto.offset.reset: if 'latest' and the consumer just started, it will only see new messages. If it has no partitions, the group is not subscribing correctly — check topic name typo or regex pattern.
How do I verify that my consumer is actually connected to the broker?
Enable DEBUG logging for org.apache.kafka.clients.NetworkClient: log4j.logger.org.apache.kafka.clients.NetworkClient=DEBUG. You should see 'Sending metadata request' and 'Received metadata response'. Also check the broker's server.log for the consumer's IP in 'Accepted connection'. Use kafka-consumer-groups --describe --members to see active members.
My consumer rebalances every few seconds. What's wrong?
This is a rebalance storm. Common causes: session.timeout.ms too low (default 10s) causing timeouts if poll() takes >10s; or max.poll.interval.ms too low (5min) if processing per batch exceeds it. Increase both: session.timeout.ms=30000, max.poll.interval.ms=600000. Also ensure your poll loop doesn't block (e.g., no synchronous database calls).
My consumer group shows lag but no messages are processed. Why?
The consumer is assigned partitions and has offsets, but likely stuck on a deserialization error. Check logs for SerializationException or set a custom DeserializationExceptionHandler to log at WARN. Also verify the key.deserializer and value.deserializer match the producer's serializers.
How do I reset my consumer offset to reprocess old messages?
Use the kafka-consumer-groups command: kafka-consumer-groups --bootstrap-server localhost:9092 --group <group> --reset-offsets --to-earliest --execute. This works for existing groups. For a fresh group, set auto.offset.reset=earliest in the consumer config. Be careful: this will cause the consumer to reprocess all messages from the beginning of the topic.