↓Skip to main content

Table of Contents

Message Queue Issues, Incidents, and Mitigation Strategies #

Performance Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Queue saturation	S: E-commerce site during Black Friday sale T: Handle 10x normal order volume A: Queue saturated as producers wrote faster than consumers could process R: Site became unresponsive with 30+ minute checkout delays	- Fixed-size queues - Insufficient consumer scaling - Monolithic consumers - Lack of backpressure mechanisms	Consumer-driven Throttling Pattern Implement backpressure where consumers signal capacity to producers	- 2020 Robinhood trading platform outage during market volatility - 2018 Netflix Christmas Eve outage when AWS SQS became saturated
High latency in message delivery	S: Healthcare monitoring system tracking patient vitals T: Process and alert on critical readings within seconds A: Network congestion created 45+ second delay in message processing R: Critical patient alerts delayed, resulting in delayed intervention	- Network congestion - Excessive message size - Poor queue placement - Chatty applications - Cross-datacenter messaging	Queue Locality Pattern Place queues close to both producers and consumers, implement edge queuing	- 2019 Slack outage where message delivery was delayed by minutes - Multiple AWS SQS regional delays reported on StatusPage
Slow consumer processing	S: Financial data processing pipeline T: Process transaction data for end-of-day reporting A: Consumer logic contained expensive database operations, slowing processing R: Processing fell behind, reporting deadlines missed	- Synchronous processing within consumers - DB calls in processing path - Unoptimized deserialization - Heavy computational workloads	Consumer Decomposition Pattern Break complex consumers into staged processing chains with intermediate queues	- 2021 Coinbase transaction processing delays - GitHub webhook delivery backlogs during high activity periods

Reliability Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Message loss	S: Banking system processing transfers T: Ensure all money transfers complete reliably A: Queue server crashed before persisting messages to disk R: Several customer transfers vanished, requiring manual reconciliation	- In-memory queue storage - Insufficient replication - Improper queue persistence config - Aggressive timeouts - Non-transactional operations	At-Least-Once Delivery Pattern with idempotent consumers and message acknowledgment after processing	- 2017 Square payment processing issue with lost transactions - RabbitMQ message loss reported in multiple incidents before v3.8
Duplicate message delivery	S: Notification service for mobile app T: Send push notifications to users A: Network timeout caused redelivery, but original message was already processed R: Users received duplicate notifications, increasing complaint volume	- Aggressive redelivery settings - Poor consumer tracking - Network instability - Improper ack handling - Client disconnections	Idempotent Consumer Pattern Track processed message IDs and deduplicate at consumer	- Stripe duplicate payment processing in 2018 - Twilio SMS duplicate deliveries reported during network instability
Out-of-order message processing	S: Content management system updating articles T: Process edits in correct sequence A: Message processing across partitioned queue occurred out of sequence R: Article content showed inconsistent state to readers	- Multi-partition queues - Lack of sequence IDs - Non-sticky partition routing - Consumer autoscaling - Race conditions	Sequencing Pattern Use sequence numbers and reorder buffer at consumer	- Kafka order guarantees being broken when topic partition count changed - Apache Pulsar out-of-order delivery issues in multi-datacenter setups

Consistency & Data Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Poisonous messages	S: Log processing service T: Parse and store application logs A: Malformatted log entry caused consumer to crash, requeue message, and crash again R: Entire log processing pipeline stalled	- Lack of message validation - Poor error handling - Automatic requeueing - Brittle deserialization - Strict schema enforcement	Dead Letter Channel Pattern with circuit breaker after retry limit	- RabbitMQ consumer crashes causing poison message loops - Elasticsearch ingest node failures with malformed documents
Schema evolution issues	S: Microservice ecosystem with event-driven architecture T: Deploy new version with schema changes A: New producer version sent messages consumers couldn’t parse R: Downstream processing failed, multiple service alerts triggered	- Tight schema coupling - No schema versioning - Hard validation failures - Backward incompatible changes - Uncoordinated deployments	Schema Registry Pattern with backward/forward compatibility enforcement	- Confluent Schema Registry misconfigurations - LinkedIn Kafka schema evolution incidents outlined in tech blog
Inconsistent message state	S: Distributed order system T: Process orders across multiple services A: Network partition caused replicated queues to diverge R: Order inventory and billing processed differently, creating data inconsistencies	- Eventual consistency models - Lack of consensus algorithms - Optimistic replication - Split-brain scenarios - Poor partition handling	Event Sourcing Pattern with conflict resolution strategies	- MongoDB replica set split-brain scenarios - Kafka cross-regional replication inconsistencies

Operational Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Queue monitoring gaps	S: SaaS platform with backend processing T: Ensure timely processing of user requests A: Queue depth grew silently without alerting R: Substantial processing backlog discovered only after customer complaints	- Lack of instrumentation - Monitoring of rates not absolute values - No predictive monitoring - Missing alert thresholds - Incomplete observability	Queue Telemetry Pattern with predictive monitoring and SLO-based alerting	- 2020 Datadog agent queue monitoring gaps during high volume - Heroku Redis queue monitoring blindspots reported by users
Complex failure recovery	S: Payment processing system T: Recover from datacenter outage A: Replicated queue failed over but replication lag caused inconsistent state R: Manual intervention required to reconcile transactions, delaying recovery	- Ad-hoc recovery procedures - Lack of recovery automation - Inconsistent backups - Undefined recovery point objectives - Complex manual steps	Queue Checkpointing Pattern with automated recovery orchestration	- 2019 CockroachDB recovery complexity detailed in post-mortem - Elasticsearch queue recovery challenges during master failures
Queue configuration drift	S: Multi-environment microservice deployment T: Maintain consistent behavior across environments A: Queue timeout settings differed between staging and production R: Tests passed in staging but failed in production with timeout errors	- Manual configuration - Environment-specific settings - Configuration stored outside version control - Lack of validation - Ad-hoc changes	Infrastructure as Code Pattern with configuration validation and drift detection	- Puppet Labs reports on RabbitMQ misconfiguration causing production issues - ActiveMQ configuration drift incidents discussed in Apache mailing lists

Design & Implementation Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Inappropriate queue topology	S: Media processing platform handling video conversions T: Process large video files through multiple conversion steps A: Single queue used for all processing stages created bottlenecks R: Processing throughput collapsed under load	- One-size-fits-all queue design - Lack of domain-driven queue boundaries - Monolithic queue usage - Missing specialized queues - Insufficient workload analysis	Staged Event-Driven Architecture (SEDA) Pattern with specialized queues per processing stage	- Netflix video processing pipeline redesign discussed in tech blog - Spotify’s event delivery system evolution in engineering blog
Poor message prioritization	S: Customer support ticket system T: Process urgent support requests quickly A: All tickets in same queue with FIFO processing R: Critical issues waited behind minor requests, SLAs breached	- FIFO-only queuing - Missing message attributes - Lack of priority lanes - Single consumer group - Uniform processing strategy	Priority Queue Pattern with multiple consumer groups for different priorities	- Zendesk scaling challenges with ticket prioritization - Jira issue processing prioritization failures discussed in Atlassian blog
Inadequate retry policies	S: IoT device command processing T: Ensure commands reach offline devices when they reconnect A: Fixed retry window expired before devices came online R: Commands lost, requiring manual resend	- Fixed retry counts - Linear retry timing - Missing persistent retry queues - Memory-only retry tracking - Uniform retry policies	Exponential Backoff Pattern with persistent retry store and TTL policies	- AWS IoT Core message delivery failures detailed in forums - MQTT broker retry handling challenges in Eclipse Mosquitto

Security Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Unauthorized queue access	S: Internal analytics processing pipeline T: Process sensitive business data securely A: Default queue credentials used, allowing unintended access R: Competitor obtained access to business intelligence data	- Default credentials - Shared access keys - Overly permissive ACLs - Lack of authentication - Missing network isolation	Valet Key Pattern with short-lived, scoped access tokens	- 2017 Insecure RabbitMQ installations exposed on internet - Redis queue security vulnerabilities allowing unauthorized access
Message tampering	S: Supply chain management system T: Process orders between partners securely A: Man-in-the-middle attack modified order quantities R: Excess inventory ordered at partner expense	- Unencrypted message transport - Missing message signatures - Plain-text sensitive data - Lack of message integrity checking - Insecure endpoints	Message Envelope Encryption Pattern with digital signatures and content integrity validation	- 2019 AMQP plaintext credential exposure - Apache ActiveMQ CVE-2015-5254 allowing remote code execution
Denial-of-service attacks	S: Public API with message queue backend T: Process legitimate API requests A: Attacker flooded queue with large messages R: Queue resources exhausted, service unavailable	- No rate limiting - Unbounded message sizes - Missing authentication - Resource over-allocation - Public queue endpoints	Throttling Pattern with client identification and resource quotas	- RabbitMQ memory exhaustion from large messages - Kafka broker DoS vulnerabilities discussed in security advisories

Scalability Issues #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Poor horizontal scaling	S: Real-time analytics platform T: Scale to handle 5x increase in event volume A: Single queue broker became bottleneck despite consumer scaling R: Analytics delayed, dashboards showed stale data	- Single broker design - Shared queue infrastructure - Nonpartitioned workloads - Fixed cluster sizing - Monolithic message store	Competing Consumers Pattern with sharded queues and dynamic consumer scaling	- Twitter’s migration from single to distributed queue architecture - Reddit comment processing delays during traffic spikes
Queue rebalancing problems	S: Streaming data processing platform T: Add new processing nodes to handle increased load A: Rebalancing caused all processing to pause for minutes R: Data processing SLA violations, delayed insights	- Stop-the-world rebalancing - Lack of incremental scaling - Large partition counts - Non-elastic architecture - Stateful consumers	Partition Assignment Strategy Pattern with incremental cooperative rebalancing	- Kafka rebalancing storms in large clusters - LinkedIn’s struggles with consumer rebalancing detailed in engineering blog
Uneven load distribution	S: Log aggregation service T: Process logs from thousands of sources evenly A: Hash-based partitioning created hotspots on specific partitions R: Some partitions overwhelmed while others idle, effective throughput reduced	- Naive hash partitioning - Static partition assignment - Skewed workloads - Uniform resource allocation - Key-based routing without analysis	Consistent Hashing Pattern with load-aware dynamic partitioning	- Elasticsearch hot shard issues in production clusters - Amazon Kinesis partition key design challenges in AWS blog

There's no articles to list here yet.