Skip to main content
  1. System Design Components/

Distributed Log Issues, Incidents, and Mitigation Strategies #

Data Consistency Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Log divergenceS: Multi-region financial system tracking transactions
T: Maintain consistent transaction history across regions
A: Network partition caused logs to accept different writes in different regions
R: Transaction histories diverged, requiring complex reconciliation
- Weak consistency models
- Asynchronous replication
- Multi-master configuration
- Lack of consensus protocol
- Aggressive failover
Consistent Prefix Pattern
with consensus protocol (Raft/Paxos)
- MongoDB replica set divergence incidents
- Elasticsearch split-brain scenarios before version 7.x
- CockroachDB post-mortems on consistency challenges
Phantom readsS: Analytics platform processing event streams
T: Generate accurate daily reports
A: Some log segments were read twice due to improper offset management
R: Inflated metrics reported to stakeholders
- Improper checkpoint management
- Concurrent reader processes
- Manual offset commits
- Consumer restarts without state
- Race conditions
Read-Committed Consumer Pattern
with transactional offset management
- Kafka consumer group rebalances causing duplicate processing
- Kinesis replay incidents documented in AWS forums
Log truncationS: Compliance logging system for financial trades
T: Retain all trade logs for 7 years
A: Automatic log rotation truncated old logs prematurely
R: Regulatory audit failed due to missing records
- Aggressive retention policies
- Limited storage constraints
- Improper backup procedures
- Missing immutability guarantees
- Volume-based truncation
Log Archival Pattern
with immutable, tiered storage
- Elasticsearch log loss in production clusters
- Splunk data truncation incidents reported by users
- HDFS log truncation due to storage pressure

Performance Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Write amplificationS: IoT monitoring platform
T: Handle millions of sensor readings per minute
A: Each write caused excessive internal writes due to poor log structure
R: Storage subsystem overloaded, causing high latency and throttling
- Small record sizes
- Excessive indexing
- Frequent checkpoints
- High update frequency
- Journal-heavy design
Log Compaction Pattern
with batched writes and efficient record layout
- Cassandra write amplification in time-series workloads
- InfluxDB performance degradation under high cardinality
- RocksDB amplification issues on certain storage types
Read hotspotsS: Real-time dashboard system
T: Display current system metrics from logs
A: Recent log segments repeatedly accessed by multiple readers
R: High disk IO, increased latency for all operations
- Tail-heavy read patterns
- Missing read caching
- Latest-only access patterns
- Read-after-write scenarios
- Polling-based consumption
Read Cache Pattern
with materialized views for popular segments
- Kafka read hotspots on recent partitions
- Elasticsearch hot shard issues with time-based indices
- MongoDB slow queries on recent oplog entries
Log compaction bottlenecksS: E-commerce site tracking user activity
T: Maintain manageable log size while preserving latest values
A: Compaction process consumed excessive resources during peak hours
R: Overall system slowdown affecting customer experience
- Aggressive compaction scheduling
- Monolithic compaction jobs
- Missing resource limits
- Synchronous compaction
- Large log-to-space ratio
Tiered Compaction Pattern
with resource-aware scheduling
- RocksDB compaction stalls in production
- Cassandra compaction strategy failures
- Kafka log compaction performance problems at scale
Log fragmentationS: Log analytics platform
T: Process logs efficiently for rapid query results
A: Frequent small appends created thousands of small files
R: Query performance degraded due to excessive file operations
- Small write batches
- High-frequency ingestion
- Fixed file rolling policies
- Missing consolidation
- Append-only operations
Log Consolidation Pattern
with adaptive segment management
- HDFS small files problem in production clusters
- Elasticsearch shard fragmentation issues
- Splunk indexer fragmentation challenges

Operational Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Replication lagS: Global gaming leaderboard system
T: Maintain consistent player rankings across regions
A: Replication lag grew to minutes during peak play periods
R: Players saw inconsistent rankings, leading to confusion and complaints
- Asynchronous replication
- Network bandwidth constraints
- Write-heavy workloads
- Single-threaded apply
- Large transaction sizes
Parallel Apply Pattern
with lag monitoring and adaptive throttling
- MySQL replication lag incidents
- Kafka MirrorMaker replication delays
- MongoDB secondary lag during high write loads
Failed node recoveryS: Critical infrastructure monitoring system
T: Recover failed logging node without data loss
A: Recovery process required full data copy from leader, taking hours
R: Monitoring gaps during extended recovery period
- Full sync recovery model
- Large log sizes
- Lack of incremental recovery
- Missing change tracking
- Monolithic log structure
Segmented Recovery Pattern
with incremental catch-up
- Elasticsearch recovery incidents
- Kafka broker recovery challenges
- HDFS NameNode recovery time problems
Log corruptionS: Payment processing audit system
T: Ensure all payment logs are valid and readable
A: Storage subsystem errors caused silent log corruption
R: Audit trails contained gaps, compliance issues reported
- Missing integrity checks
- No end-to-end verification
- Hardware failures
- Filesystem corruption
- Power failures during writes
Forward Error Correction Pattern
with checksums and repair mechanisms
- ZooKeeper data corruption incidents
- Kafka log corruption due to disk failures
- ELK stack corrupted indices after improper shutdown
Quota exhaustionS: Customer analytics platform
T: Store all customer interaction logs
A: Unexpected traffic spike exhausted storage quota
R: New log entries rejected, losing valuable data
- Fixed quota allocation
- Missing monitoring alerts
- Inadequate headroom
- Growth projection failures
- Monolithic storage pools
Dynamic Resource Allocation Pattern
with predictive scaling and multi-tier storage
- Google Cloud Logging quota exceeded errors
- AWS CloudWatch Logs throttling incidents
- Splunk indexer volume cap issues in SaaS deployments

Scalability Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Shard imbalanceS: Social media event logging system
T: Evenly distribute logging load across infrastructure
A: Popular accounts generated disproportionate logs to specific shards
R: Some shards overloaded while others underutilized
- Key-based sharding
- Static shard assignment
- Skewed workload patterns
- Unchanging partition strategy
- Missing load balancing
Dynamic Sharding Pattern
with workload-aware rebalancing
- Elasticsearch hot shards in production
- MongoDB chunk migration challenges
- Kafka partition skew with certain key distributions
Partition growth limitsS: Enterprise logging infrastructure
T: Scale to accommodate company growth
A: Individual partitions reached size limits while cluster had available space
R: Log ingestion failed despite adequate total capacity
- Fixed partition sizing
- Single-dimension scaling
- Limited partition counts
- Vertical-only growth model
- Pre-allocated partitions
Hierarchical Partitioning Pattern
with recursive splitting strategies
- HDFS individual file size limitations
- Elasticsearch single shard size recommendations
- Cassandra partition size limitations
Throughput ceilingS: Ad impression tracking system
T: Handle holiday season traffic spike
A: Log system throughput plateaued despite adding nodes
R: Log sampling required, reducing analytics accuracy
- Single-writer designs
- Coordinator bottlenecks
- Lock contention
- Sequential consistency requirements
- Centralized metadata
Multi-Leader Pattern
with conflict resolution strategies
- ZooKeeper write throughput limitations
- Single Kafka controller bottlenecks
- Centralized Elasticsearch master node constraints
Fan-in bottlenecksS: Distributed security monitoring system
T: Aggregate logs from thousands of endpoints
A: Central collectors became bottleneck for log processing
R: Security event detection delayed, increasing vulnerability window
- Centralized aggregation
- Direct endpoint-to-central logging
- Missing intermediate aggregation
- Star topology design
- Push-based architectures
Hierarchical Collection Pattern
with edge aggregation and processing
- Fluentd buffer overflow issues
- Logstash scaling challenges in large deployments
- Splunk forwarder to indexer bottlenecks

Security Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Log injectionS: Web application security logging
T: Record user activities for security monitoring
A: Attacker injected formatted log entries via username field
R: Log parsers misinterpreted injected entries, creating false events
- Unvalidated user input in logs
- Raw string concatenation
- Plain text log formats
- Missing encoding/escaping
- Format string vulnerabilities
Structured Logging Pattern
with strict schema validation
- Elasticsearch Logstash SSRF vulnerabilities
- Log4j/Log4Shell vulnerabilities
- Splunk injection attacks via forged events
Sensitive data exposureS: Healthcare system audit logging
T: Maintain HIPAA-compliant activity records
A: PHI accidentally included in plaintext logs
R: Compliance violation, potential data breach reported
- Overly verbose logging
- Missing data filtering
- Inadequate masking rules
- Catch-all logging practices
- Debug logging in production
Sanitized Logging Pattern
with PII detection and redaction
- Various GDPR violations from log exposure
- Exposed access tokens in GitHub logs
- Cloud provider logging exposing sensitive credentials
Unauthorized accessS: Financial transaction logging system
T: Restrict log access to authorized personnel
A: Weak ACLs allowed broad internal access to log data
R: Sensitive financial data accessed by unauthorized employees
- Coarse-grained permissions
- Shared service accounts
- Missing access auditing
- Homogeneous security policy
- Default configurations
Log Access Control Pattern
with attribute-based access control
- Elasticsearch open instance discoveries
- Kibana unauthorized access incidents
- MongoDB exposed log collections
Tamper resistance failuresS: Banking regulatory compliance system
T: Maintain tamper-evident audit logs
A: Administrator modified logs to hide policy violations
R: Regulatory audit failed due to detected inconsistencies
- Mutable log storage
- Missing cryptographic proofs
- Administrator super-access
- Lack of external verification
- Single system of record
Immutable Append-Only Pattern
with cryptographic verification chains
- MongoDB oplog manipulation incidents
- Manual Elasticsearch document alteration
- Log deletion to hide intrusion evidence

Query & Analytics Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Query timeoutsS: E-commerce analytics platform
T: Generate business insights from customer activity logs
A: Complex queries across large log spans timed out
R: Business decisions made with incomplete data
- Full-scan query patterns
- Missing query optimizations
- Inadequate indexing
- Large result sets
- Long time-range queries
Materialized View Pattern
with pre-aggregation strategies
- Elasticsearch query timeout issues
- Splunk search performance problems
- InfluxDB timeout incidents with high-cardinality data
Cardinality explosionS: Infrastructure monitoring system
T: Track performance across microservice ecosystem
A: High dimensionality of tags created billions of unique series
R: Query performance collapsed, dashboards unusable
- Unbounded dimensions
- High-cardinality fields
- Excessive tag combinations
- Missing cardinality limits
- Auto-generated identifiers as dimensions
Dimensional Reduction Pattern
with cardinality limits and bucketing
- Prometheus cardinality limits in production
- InfluxDB high-cardinality field issues
- Datadog overhead from high-cardinality metrics
Schema evolution problemsS: Long-running analytics platform
T: Add new log fields while maintaining historical query capability
A: Schema changes broke queries against historical data
R: Business reports showed inconsistent results across time periods
- Rigid schema definitions
- Schema-on-write approaches
- Breaking field type changes
- Missing schema versioning
- Inconsistent field naming
Schema-on-Read Pattern
with backward compatible evolution
- Elasticsearch mapping explosion issues
- Avro schema evolution challenges
- Parquet schema enforcement problems
Incomplete query resultsS: Security information and event management system
T: Detect security breaches across environment
A: Query federation returned partial results when some nodes timed out
R: Security incidents missed due to incomplete data analysis
- Timeout-based query termination
- Missing partial result handling
- Synchronous query patterns
- All-or-nothing error handling
- Poor node health detection
Partial Results Pattern
with result quality indicators
- Elasticsearch partial search results
- Distributed Splunk search inconsistencies
- Federated GraphQL query timeout issues

Integration & Ingestion Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Log format incompatibilityS: Multi-vendor security log aggregation
T: Consolidate logs from diverse security products
A: Format variations caused parser failures for certain sources
R: Security blind spots in monitoring, increasing vulnerability
- Tight format coupling
- Brittle parsing logic
- Vendor-specific formats
- Missing format normalization
- Hardcoded field mappings
Canonical Log Format Pattern
with adaptive parsing strategies
- ELK stack Logstash parsing failures
- Graylog extractor configuration issues
- Splunk Universal Forwarder field extraction problems
Rate limitingS: Microservice observability platform
T: Collect traces from hundreds of services
A: Bursty log traffic triggered rate limiting
R: Gaps in observability during critical incidents
- Fixed ingestion capacity
- Missing backpressure mechanisms
- Traffic spikes during incidents
- Synchronous log transmission
- No buffering strategies
Adaptive Rate Limiting Pattern
with client-side buffering
- Datadog API rate limiting during incidents
- AWS CloudWatch throttling errors
- Sumo Logic ingestion quota exceeded issues
Clock skewS: Distributed transaction monitoring
T: Track operation timing across system components
A: Server clock differences caused event sequence confusion
R: Root cause analysis failed due to apparent causality violations
- Reliance on local timestamps
- Missing time synchronization
- Distributed deployment
- Cross-datacenter logging
- Different time sources
Logical Clock Pattern
with vector timestamps
- Google Cloud operations ordering problems
- Kubernetes log timestamp inconsistencies
- Distributed tracing causality issues in OpenTelemetry
Duplicate log entriesS: Business intelligence data pipeline
T: Process business events reliably once
A: Log replay after failure created duplicate entries
R: Inflated metrics reported to executives
- At-least-once delivery
- Missing deduplication
- Retry logic without idempotency
- Failover reprocessing
- Recovery from checkpoints
Idempotent Consumer Pattern
with unique event identifiers
- Kafka consumer rebalancing duplicate processing
- Kinesis replay duplication issues
- Fluentd buffer flush duplications

Monitoring & Observability Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Log observability gapsS: Critical payment processing system
T: Ensure complete visibility into transaction processing
A: Certain error paths lacked logging, creating blind spots
R: Customer-impacting failures went undetected for hours
- Inconsistent logging practices
- Missing error path instrumentation
- Insufficient context in logs
- Binary success/failure logging
- Logging as afterthought
Structured Event Pattern
with consistent context propagation
- PayPal transaction monitoring blind spots
- Stripe API error visibility challenges
- Netflix missing error logs in fallback paths
Alert fatigueS: Infrastructure monitoring system
T: Notify operators of actual problems
A: Excessive log-based alerts created noise
R: Real outage alerts missed among false positives
- Threshold-based alerting
- Context-free log monitoring
- Alert on everything approach
- Missing alert prioritization
- Static alert configurations
Anomaly Detection Pattern
with dynamic baselines and correlation
- PagerDuty incident reports on alert fatigue
- Datadog alert storm incidents
- New Relic alert tuning challenges
Needle in haystack problemS: E-commerce troubleshooting scenario
T: Find cause of specific order failures
A: Relevant error buried among billions of log entries
R: Extended resolution time impacting customer satisfaction
- Excessive debug logging
- Insufficient log structure
- Missing correlation IDs
- Flat log hierarchy
- Verbose third-party logs
Distributed Tracing Pattern
with correlated request IDs
- Zipkin trace correlation challenges
- Jaeger trace data volume management
- AWS X-Ray sampling configuration issues
Log cost explosionS: SaaS platform with growing customer base
T: Maintain comprehensive logging while managing costs
A: Log volume grew exponentially with customer growth
R: Logging costs exceeded revenue for some customer tiers
- Uniform log verbosity
- Missing cost controls
- Log everything approach
- Unoptimized retention
- Fixed sampling strategies
Dynamic Sampling Pattern
with context-aware verbosity control
- Datadog billing surprises from log volume
- AWS CloudWatch Logs cost optimization challenges
- Splunk license capacity issues

Archival & Compliance Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Retention policy conflictsS: Multinational corporation log management
T: Comply with conflicting regional regulations
A: Global retention policy violated GDPR right-to-forget requirements
R: Regulatory fine issued for compliance failure
- Single global policy
- Missing data sovereignty
- Monolithic log storage
- Immutable-only approach
- All-or-nothing retention
Segmented Compliance Pattern
with regional policy enforcement
- GDPR compliance failures in global systems
- Cloud provider multi-region compliance challenges
- Financial industry retention requirement conflicts
Incomplete log captureS: Healthcare system HIPAA compliance
T: Maintain complete audit trail of PHI access
A: Some access paths missed in logging implementation
R: Failed compliance audit, requiring remediation plan
- Inconsistent instrumentation
- Developer-dependent logging
- Architectural blind spots
- Retrofit logging approaches
- Missing log coverage testing
Aspect-Oriented Logging Pattern
with automated coverage verification
- HIPAA violation cases from audit gaps
- SOX compliance failures in financial systems
- PCI-DSS audit failures from incomplete logging
Archival retrieval delaysS: Legal discovery process for litigation
T: Retrieve specific historical logs for court-ordered discovery
A: Archived logs took weeks to restore and search
R: Court sanctions for delayed evidence production
- Offline archival strategy
- Format changes in archives
- Missing archive indexing
- Tape-based sequential access
- Cold storage without search
Searchable Archive Pattern
with tiered storage and indices
- Legal cases involving delayed e-discovery
- GDPR subject access request timing failures
- Financial audit retrieval compliance issues
Chain of custody breaksS: Security breach investigation
T: Provide forensically sound evidence for investigation
A: Log transfer process broke chain of custody validation
R: Evidence deemed inadmissible in legal proceedings
- Missing cryptographic verification
- Log transformation without validation
- Multiple handling steps
- Inadequate metadata preservation
- Format conversions
Forensic Logging Pattern
with cryptographic proof chains
- Court cases rejecting log evidence
- Internal investigation integrity challenges
- Regulatory findings on log manipulation potential

There's no articles to list here yet.