Skip to main content
  1. System Design Components/

Message Queue Issues, Incidents, and Mitigation Strategies #

Performance Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Queue saturationS: E-commerce site during Black Friday sale
T: Handle 10x normal order volume
A: Queue saturated as producers wrote faster than consumers could process
R: Site became unresponsive with 30+ minute checkout delays
- Fixed-size queues
- Insufficient consumer scaling
- Monolithic consumers
- Lack of backpressure mechanisms
Consumer-driven Throttling Pattern
Implement backpressure where consumers signal capacity to producers
- 2020 Robinhood trading platform outage during market volatility
- 2018 Netflix Christmas Eve outage when AWS SQS became saturated
High latency in message deliveryS: Healthcare monitoring system tracking patient vitals
T: Process and alert on critical readings within seconds
A: Network congestion created 45+ second delay in message processing
R: Critical patient alerts delayed, resulting in delayed intervention
- Network congestion
- Excessive message size
- Poor queue placement
- Chatty applications
- Cross-datacenter messaging
Queue Locality Pattern
Place queues close to both producers and consumers, implement edge queuing
- 2019 Slack outage where message delivery was delayed by minutes
- Multiple AWS SQS regional delays reported on StatusPage
Slow consumer processingS: Financial data processing pipeline
T: Process transaction data for end-of-day reporting
A: Consumer logic contained expensive database operations, slowing processing
R: Processing fell behind, reporting deadlines missed
- Synchronous processing within consumers
- DB calls in processing path
- Unoptimized deserialization
- Heavy computational workloads
Consumer Decomposition Pattern
Break complex consumers into staged processing chains with intermediate queues
- 2021 Coinbase transaction processing delays
- GitHub webhook delivery backlogs during high activity periods

Reliability Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Message lossS: Banking system processing transfers
T: Ensure all money transfers complete reliably
A: Queue server crashed before persisting messages to disk
R: Several customer transfers vanished, requiring manual reconciliation
- In-memory queue storage
- Insufficient replication
- Improper queue persistence config
- Aggressive timeouts
- Non-transactional operations
At-Least-Once Delivery Pattern
with idempotent consumers and message acknowledgment after processing
- 2017 Square payment processing issue with lost transactions
- RabbitMQ message loss reported in multiple incidents before v3.8
Duplicate message deliveryS: Notification service for mobile app
T: Send push notifications to users
A: Network timeout caused redelivery, but original message was already processed
R: Users received duplicate notifications, increasing complaint volume
- Aggressive redelivery settings
- Poor consumer tracking
- Network instability
- Improper ack handling
- Client disconnections
Idempotent Consumer Pattern
Track processed message IDs and deduplicate at consumer
- Stripe duplicate payment processing in 2018
- Twilio SMS duplicate deliveries reported during network instability
Out-of-order message processingS: Content management system updating articles
T: Process edits in correct sequence
A: Message processing across partitioned queue occurred out of sequence
R: Article content showed inconsistent state to readers
- Multi-partition queues
- Lack of sequence IDs
- Non-sticky partition routing
- Consumer autoscaling
- Race conditions
Sequencing Pattern
Use sequence numbers and reorder buffer at consumer
- Kafka order guarantees being broken when topic partition count changed
- Apache Pulsar out-of-order delivery issues in multi-datacenter setups

Consistency & Data Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Poisonous messagesS: Log processing service
T: Parse and store application logs
A: Malformatted log entry caused consumer to crash, requeue message, and crash again
R: Entire log processing pipeline stalled
- Lack of message validation
- Poor error handling
- Automatic requeueing
- Brittle deserialization
- Strict schema enforcement
Dead Letter Channel Pattern
with circuit breaker after retry limit
- RabbitMQ consumer crashes causing poison message loops
- Elasticsearch ingest node failures with malformed documents
Schema evolution issuesS: Microservice ecosystem with event-driven architecture
T: Deploy new version with schema changes
A: New producer version sent messages consumers couldn’t parse
R: Downstream processing failed, multiple service alerts triggered
- Tight schema coupling
- No schema versioning
- Hard validation failures
- Backward incompatible changes
- Uncoordinated deployments
Schema Registry Pattern
with backward/forward compatibility enforcement
- Confluent Schema Registry misconfigurations
- LinkedIn Kafka schema evolution incidents outlined in tech blog
Inconsistent message stateS: Distributed order system
T: Process orders across multiple services
A: Network partition caused replicated queues to diverge
R: Order inventory and billing processed differently, creating data inconsistencies
- Eventual consistency models
- Lack of consensus algorithms
- Optimistic replication
- Split-brain scenarios
- Poor partition handling
Event Sourcing Pattern
with conflict resolution strategies
- MongoDB replica set split-brain scenarios
- Kafka cross-regional replication inconsistencies

Operational Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Queue monitoring gapsS: SaaS platform with backend processing
T: Ensure timely processing of user requests
A: Queue depth grew silently without alerting
R: Substantial processing backlog discovered only after customer complaints
- Lack of instrumentation
- Monitoring of rates not absolute values
- No predictive monitoring
- Missing alert thresholds
- Incomplete observability
Queue Telemetry Pattern
with predictive monitoring and SLO-based alerting
- 2020 Datadog agent queue monitoring gaps during high volume
- Heroku Redis queue monitoring blindspots reported by users
Complex failure recoveryS: Payment processing system
T: Recover from datacenter outage
A: Replicated queue failed over but replication lag caused inconsistent state
R: Manual intervention required to reconcile transactions, delaying recovery
- Ad-hoc recovery procedures
- Lack of recovery automation
- Inconsistent backups
- Undefined recovery point objectives
- Complex manual steps
Queue Checkpointing Pattern
with automated recovery orchestration
- 2019 CockroachDB recovery complexity detailed in post-mortem
- Elasticsearch queue recovery challenges during master failures
Queue configuration driftS: Multi-environment microservice deployment
T: Maintain consistent behavior across environments
A: Queue timeout settings differed between staging and production
R: Tests passed in staging but failed in production with timeout errors
- Manual configuration
- Environment-specific settings
- Configuration stored outside version control
- Lack of validation
- Ad-hoc changes
Infrastructure as Code Pattern
with configuration validation and drift detection
- Puppet Labs reports on RabbitMQ misconfiguration causing production issues
- ActiveMQ configuration drift incidents discussed in Apache mailing lists

Design & Implementation Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Inappropriate queue topologyS: Media processing platform handling video conversions
T: Process large video files through multiple conversion steps
A: Single queue used for all processing stages created bottlenecks
R: Processing throughput collapsed under load
- One-size-fits-all queue design
- Lack of domain-driven queue boundaries
- Monolithic queue usage
- Missing specialized queues
- Insufficient workload analysis
Staged Event-Driven Architecture (SEDA) Pattern
with specialized queues per processing stage
- Netflix video processing pipeline redesign discussed in tech blog
- Spotify’s event delivery system evolution in engineering blog
Poor message prioritizationS: Customer support ticket system
T: Process urgent support requests quickly
A: All tickets in same queue with FIFO processing
R: Critical issues waited behind minor requests, SLAs breached
- FIFO-only queuing
- Missing message attributes
- Lack of priority lanes
- Single consumer group
- Uniform processing strategy
Priority Queue Pattern
with multiple consumer groups for different priorities
- Zendesk scaling challenges with ticket prioritization
- Jira issue processing prioritization failures discussed in Atlassian blog
Inadequate retry policiesS: IoT device command processing
T: Ensure commands reach offline devices when they reconnect
A: Fixed retry window expired before devices came online
R: Commands lost, requiring manual resend
- Fixed retry counts
- Linear retry timing
- Missing persistent retry queues
- Memory-only retry tracking
- Uniform retry policies
Exponential Backoff Pattern
with persistent retry store and TTL policies
- AWS IoT Core message delivery failures detailed in forums
- MQTT broker retry handling challenges in Eclipse Mosquitto

Security Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Unauthorized queue accessS: Internal analytics processing pipeline
T: Process sensitive business data securely
A: Default queue credentials used, allowing unintended access
R: Competitor obtained access to business intelligence data
- Default credentials
- Shared access keys
- Overly permissive ACLs
- Lack of authentication
- Missing network isolation
Valet Key Pattern
with short-lived, scoped access tokens
- 2017 Insecure RabbitMQ installations exposed on internet
- Redis queue security vulnerabilities allowing unauthorized access
Message tamperingS: Supply chain management system
T: Process orders between partners securely
A: Man-in-the-middle attack modified order quantities
R: Excess inventory ordered at partner expense
- Unencrypted message transport
- Missing message signatures
- Plain-text sensitive data
- Lack of message integrity checking
- Insecure endpoints
Message Envelope Encryption Pattern
with digital signatures and content integrity validation
- 2019 AMQP plaintext credential exposure
- Apache ActiveMQ CVE-2015-5254 allowing remote code execution
Denial-of-service attacksS: Public API with message queue backend
T: Process legitimate API requests
A: Attacker flooded queue with large messages
R: Queue resources exhausted, service unavailable
- No rate limiting
- Unbounded message sizes
- Missing authentication
- Resource over-allocation
- Public queue endpoints
Throttling Pattern
with client identification and resource quotas
- RabbitMQ memory exhaustion from large messages
- Kafka broker DoS vulnerabilities discussed in security advisories

Scalability Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Poor horizontal scalingS: Real-time analytics platform
T: Scale to handle 5x increase in event volume
A: Single queue broker became bottleneck despite consumer scaling
R: Analytics delayed, dashboards showed stale data
- Single broker design
- Shared queue infrastructure
- Nonpartitioned workloads
- Fixed cluster sizing
- Monolithic message store
Competing Consumers Pattern
with sharded queues and dynamic consumer scaling
- Twitter’s migration from single to distributed queue architecture
- Reddit comment processing delays during traffic spikes
Queue rebalancing problemsS: Streaming data processing platform
T: Add new processing nodes to handle increased load
A: Rebalancing caused all processing to pause for minutes
R: Data processing SLA violations, delayed insights
- Stop-the-world rebalancing
- Lack of incremental scaling
- Large partition counts
- Non-elastic architecture
- Stateful consumers
Partition Assignment Strategy Pattern
with incremental cooperative rebalancing
- Kafka rebalancing storms in large clusters
- LinkedIn’s struggles with consumer rebalancing detailed in engineering blog
Uneven load distributionS: Log aggregation service
T: Process logs from thousands of sources evenly
A: Hash-based partitioning created hotspots on specific partitions
R: Some partitions overwhelmed while others idle, effective throughput reduced
- Naive hash partitioning
- Static partition assignment
- Skewed workloads
- Uniform resource allocation
- Key-based routing without analysis
Consistent Hashing Pattern
with load-aware dynamic partitioning
- Elasticsearch hot shard issues in production clusters
- Amazon Kinesis partition key design challenges in AWS blog

There's no articles to list here yet.