Skip to main content
  1. System Design Components/

Load Balancer Issues, Incidents, and Mitigation Strategies #

Performance & Scalability Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Throughput saturationS: E-commerce platform during flash sale
T: Handle 20x normal traffic volume
A: Load balancer reached packet processing limits
R: Dropped connections and failed purchases despite backend capacity
- Fixed capacity planning
- Single-instance design
- Undersized hardware/instances
- CPU-bound processing
- Limited network capacity
Scalable Load Balancing Pattern
with horizontal scaling and capacity headroom
- Black Friday e-commerce outages
- AWS ELB scaling limitations during spikes
- Gaming platform launch failures
Connection exhaustionS: Video streaming service during major event
T: Support millions of concurrent connections
A: Load balancer exhausted connection table capacity
R: New connection failures while existing streams continued
- Fixed connection table limits
- Long-lived connections
- Incomplete connection lifecycle management
- Missing connection limits per client
- Conservative resource allocation
Connection Management Pattern
with dynamic table sizing and client limiting
- Sporting event streaming failures
- F5 connection table exhaustion
- Netflix connection limit incidents
SSL/TLS processing bottlenecksS: Banking platform with high security requirements
T: Maintain performance while using strong encryption
A: SSL/TLS handshakes consumed excessive CPU during traffic spike
R: High latency and connection timeouts
- Software-based TLS processing
- Missing session caching
- Expensive cipher suites
- Full handshakes for all connections
- Insufficient crypto acceleration
Optimized TLS Pattern
with hardware acceleration and session reuse
- HTTPS performance degradation during peaks
- TLS 1.3 migration performance impacts
- Financial service encryption bottlenecks
Uneven scaling across tiersS: Multi-layer load balancing architecture
T: Scale frontend and backend load balancers proportionally
A: Frontend scaled but backend load balancers did not
R: Internal bottlenecks despite external capacity
- Independent scaling mechanisms
- Tier-specific monitoring
- Manual scaling processes
- Inconsistent scaling metrics
- Mismatched capacity planning
Coordinated Scaling Pattern
with cross-tier metrics and proportional scaling
- Internal API gateway bottlenecks
- AWS ALB-to-NLB scaling mismatches
- Service mesh proxy scaling inconsistencies

Health Checking & Backend Management Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Inadequate health checksS: Payment processing service
T: Ensure traffic routed only to healthy backends
A: Simple TCP checks missed application-level failures
R: Transactions routed to malfunctioning servers causing failures
- Connection-only health checking
- Missing functional verification
- Shallow health indicators
- Infrequent checking
- Binary (healthy/unhealthy) view
Deep Health Checking Pattern
with application-level verification
- Payment gateway routing to failed servers
- HAProxy backend failures despite “UP” status
- NGINX basic check limitations
Thundering herd on recoveryS: Web service after backend maintenance
T: Smoothly restore traffic to recovered backends
A: All traffic immediately routed to new backends
R: Recovered servers overwhelmed, causing secondary outage
- Binary backend status
- Immediate full restoration
- Missing warm-up periods
- Connection-count unawareness
- Traffic flood on status change
Gradual Recovery Pattern
with controlled traffic ramping
- Database connection pool exhaustion after maintenance
- Cache server overload after restarts
- Application server CPU spikes after deployment
Flapping detection failuresS: Microservice platform with unstable component
T: Prevent traffic to repeatedly failing backends
A: Backend rapidly cycled between healthy/unhealthy
R: Connection errors and latency spikes from constant rebalancing
- Simplistic health thresholds
- Missing flap detection
- Immediate state changes
- Short health check memory
- Aggressive health checking
Hysteresis Detection Pattern
with state dampening and stabilization periods
- Service mesh circuit breaker flapping
- F5 pool member oscillation
- Kubernetes readiness probe flapping
Slow backend detectionS: Database-backed web application
T: Identify and avoid underperforming backends
A: Load balancer failed to detect gradually slowing servers
R: Overall service degradation as traffic continued to slow instances
- Binary health model
- Missing performance metrics
- Response time blindness
- Availability-only focus
- Aggregate-only monitoring
Performance-Aware Balancing Pattern
with latency-based routing and ejection
- Database read replica performance degradation
- AWS ELB sending traffic to overwhelmed instances
- Content delivery performance inconsistencies

Traffic Distribution & Routing Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Session persistence failuresS: E-commerce shopping cart application
T: Maintain user session on same backend
A: Session affinity mechanism failed during scaling event
R: Cart abandonment due to lost sessions and frustrated users
- Cookie-based persistence only
- Missing fallback mechanisms
- Persistence table limitations
- Aggressive timeout policies
- Backend-stored session state
Multi-mechanism Persistence Pattern
with layered affinity strategies
- Shopping cart session loss incidents
- Banking application authentication loops
- Gaming platform inventory inconsistencies
Suboptimal load distributionS: Content delivery application
T: Distribute load evenly across backend capacity
A: Round-robin distribution led to backend hotspots
R: Some servers overloaded while others underutilized
- Simplistic distribution algorithms
- Missing backend telemetry
- Request cost unawareness
- Uniform request assumption
- Connection-focused balancing
Weighted Response Balancing Pattern
with cost-aware distribution
- API backend hotspots despite balancing
- Database read replica load imbalance
- Distributed cache node utilization skew
Geographical routing errorsS: Global application with regional deployments
T: Route users to nearest geographical deployment
A: IP-based geolocation routed users to distant regions
R: Unnecessary latency and poor user experience
- IP-only geolocation
- Static routing tables
- Missing client feedback
- Regional capacity ignorance
- Outdated geolocation data
Dynamic Geolocation Pattern
with performance-based decisions and client input
- CDN regional routing issues
- Global DNS load balancing inaccuracies
- Mobile application regional routing problems
Layer 7 routing regressionS: Microservice platform with path-based routing
T: Route requests to correct service based on URL path
A: Regular expression rule regression caused misrouting
R: Requests sent to wrong backends, causing errors
- Complex routing rules
- Manual rule management
- Missing rule verification
- Order-dependent rules
- Pattern overlap conflicts
Verified Routing Pattern
with automated rule validation and testing
- Kubernetes Ingress routing conflicts
- NGINX location block precedence issues
- HAProxy ACL rule conflicts

Availability & Resilience Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Single point of failureS: Enterprise application infrastructure
T: Maintain continuous availability
A: Primary load balancer failed with no automatic failover
R: Complete service outage despite redundant backends
- Single load balancer instance
- Manual failover procedures
- Active-passive only design
- Shared fate with management plane
- Configuration synchronization issues
Distributed Load Balancing Pattern
with active-active deployment and autonomous operation
- Single load balancer failure outages
- F5 active-standby failover issues
- AWS NLB availability zone isolation failures
Failover mechanism failuresS: Financial trading platform
T: Automatically recover from zone failure
A: Failover mechanism failed to activate properly
R: Extended outage requiring manual intervention
- Untested failover paths
- Complex failover conditions
- Missing failover monitoring
- Partial state transfer
- Split-brain possibilities
Practiced Failover Pattern
with regular testing and simplified mechanisms
- DNS failover propagation delays
- Global load balancer failover complexity
- Float IP takeover failures
Configuration synchronization issuesS: High-availability load balancer pair
T: Maintain consistent configuration across HA pair
A: Configuration drift between primary and secondary
R: Service disruption during failover due to inconsistent state
- Asynchronous config updates
- Manual configuration changes
- Missing sync verification
- Config push failures
- Incomplete state replication
Atomic Configuration Pattern
with verified synchronization and consistency checks
- F5 configuration sync failures
- NGINX Plus state synchronization issues
- HAProxy configuration inconsistencies
Degraded mode operation failuresS: E-commerce infrastructure partial outage
T: Maintain core functions during component failures
A: Load balancer removed all instances during partial backend failure
R: Complete outage instead of degraded service
- All-or-nothing availability model
- Missing graceful degradation
- Strict health requirements
- Backend interdependencies
- Cascading failure patterns
Graceful Degradation Pattern
with tiered availability and functionality reduction
- E-commerce complete outages during partial failures
- Content site unavailability during backend degradation
- Banking service hard failures instead of reduced functionality

Security Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
TLS configuration vulnerabilitiesS: Financial services web application
T: Secure sensitive customer transactions
A: Load balancer configured with weak cipher suites
R: Vulnerability to TLS downgrade attacks
- Legacy protocol support
- Weak cipher suites
- Missing security headers
- Outdated TLS versions
- Default configurations
TLS Hardening Pattern
with regular security assessment and baseline enforcement
- BEAST and POODLE vulnerability exposure
- PCI compliance failures from TLS misconfigurations
- Healthcare data protection regulation violations
DDoS protection deficienciesS: Public-facing web service
T: Maintain availability during attack
A: Volumetric attack overwhelmed load balancer capacity
R: Service unavailability affecting legitimate users
- Limited capacity planning
- Missing traffic filtering
- Lack of rate limiting
- Inadequate traffic analysis
- Single-tier protection
Defense-in-Depth DDoS Pattern
with multi-level protections and traffic scrubbing
- Major gaming platform DDoS outages
- Financial services availability SLA violations
- Media site unavailability during attacks
SSL/TLS termination data exposureS: Healthcare patient portal
T: Protect sensitive data in transit
A: Unencrypted traffic after load balancer exposed PHI
R: Compliance violation and data protection failure
- TLS termination without re-encryption
- Clear-text internal traffic
- Security boundary confusion
- Missing internal controls
- End-to-end encryption gaps
End-to-End Encryption Pattern
with internal traffic protection and sensitive data awareness
- Healthcare data exposure incidents
- Financial transaction cleartext exposure
- Personal data protection regulation violations
Load balancer access controlsS: Enterprise management infrastructure
T: Restrict load balancer administration
A: Weak admin credentials led to unauthorized configuration changes
R: Service disruption from malicious rule modifications
- Default/weak credentials
- Unnecessary access exposure
- Missing MFA
- Excessive permissions
- Inadequate access logging
Least Privilege Management Pattern
with strong authentication and authorization controls
- Network device compromise via management interfaces
- Unauthorized load balancer rule changes
- Admin interface exposure on public networks

Configuration & Operational Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Configuration complexityS: Large enterprise load balancer infrastructure
T: Manage complex routing rules across application portfolio
A: Rule complexity led to conflicting configurations
R: Unexpected routing behavior and service disruptions
- Accumulated configuration cruft
- Multiple administrator changes
- Missing documentation
- Configuration sprawl
- Lack of modular design
Configuration as Code Pattern
with version control and automated validation
- F5 iRule complexity causing outages
- NGINX configuration conflicts
- HAProxy ACL rule interaction bugs
Certificate management failuresS: E-commerce platform handling customer payments
T: Maintain valid TLS certificates
A: TLS certificate expiration went unnoticed
R: Complete payment processing outage
- Manual certificate processes
- Missing expiration monitoring
- Certificate sprawl
- Siloed responsibility
- Inadequate renewal automation
Automated Certificate Pattern
with lifecycle management and monitoring
- Major website outages from certificate expirations
- Load balancer TLS errors during expired certificate usage
- Certificate chain validation failures
Change management incidentsS: Business-critical application infrastructure
T: Update load balancer configuration safely
A: Untested configuration change caused routing errors
R: Application unavailability during business hours
- Direct production changes
- Missing staging environment
- Inadequate testing
- Manual change processes
- Insufficient rollback planning
Progressive Deployment Pattern
with canary testing and automated rollback
- Global outages from load balancer changes
- Banking service disruption during planned updates
- E-commerce unavailability from routing rule changes
Capacity planning failuresS: Retail platform during seasonal peak
T: Scale load balancing capacity for holiday shopping
A: Insufficient capacity planning led to resource exhaustion
R: Throttled connections and degraded user experience
- Historical-only capacity planning
- Missing headroom buffers
- Reactive scaling processes
- Inadequate load testing
- Fixed capacity assumptions
Predictive Capacity Pattern
with trend analysis and proactive scaling
- Black Friday retail platform failures
- Streaming service degradation during popular events
- Ticketing system failures during high-demand sales

Monitoring & Troubleshooting Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Limited visibilityS: Complex application performance issue
T: Determine source of increased latency
A: Insufficient load balancer metrics obscured bottleneck location
R: Extended troubleshooting time and prolonged performance impact
- Basic-only metrics
- Missing detailed logging
- Aggregate-only statistics
- Limited historical data
- Insufficient granularity
Comprehensive Telemetry Pattern
with detailed metrics and transaction sampling
- Performance root cause analysis delays
- Load balancer versus application blame attribution
- Network versus application troubleshooting challenges
Log management deficienciesS: Security incident investigation
T: Determine source and scope of suspicious activity
A: Insufficient load balancer logging hampered investigation
R: Incomplete understanding of attack vector and impact
- Minimal logging configuration
- Short log retention
- Missing security event logging
- Storage constraint concerns
- Performance impact fears
Security-Focused Logging Pattern
with comprehensive audit trails and sufficient retention
- Incomplete security investigation data
- Compliance finding for inadequate access logs
- Attack forensics limitations from log gaps
Alert configuration problemsS: Off-hours infrastructure incident
T: Promptly notify team of service degradation
A: Missing or misconfigured alerts delayed response
R: Extended outage duration and increased business impact
- Missing critical alerts
- Alert thresholds too lenient
- Alert fatigue from noise
- Uncorrelated alert flood
- Incomplete alerting coverage
Hierarchical Alerting Pattern
with severity-based routing and correlation
- Delayed incident response from missing alerts
- Alert storms during partial outages
- Missed critical conditions despite monitoring
Misleading metricsS: Customer-reported application slowness
T: Validate performance using monitoring systems
A: Load balancer metrics showed normal despite issues
R: Delayed troubleshooting and resolution
- Aggregate-only metrics
- Misleading averages
- Missing percentile measurements
- Backend-unaware metrics
- Client experience blindness
User-Centric Monitoring Pattern
with end-to-end measurements and percentile metrics
- “Green” dashboards during user-experienced outages
- Load balancer metrics missing critical performance indicators
- Delayed incident detection despite monitoring

Protocol & Traffic Handling Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
WebSocket handling problemsS: Real-time collaboration application
T: Support persistent WebSocket connections
A: Load balancer terminated WebSockets prematurely
R: Frequent disconnections disrupting user collaboration
- Short connection timeouts
- HTTP-focused configuration
- Missing protocol awareness
- Inadequate idle detection
- Connection resource constraints
Long-lived Connection Pattern
with protocol-aware configuration and management
- Chat application disconnection issues
- Real-time collaboration tool instability
- WebSocket timeout problems in gaming applications
HTTP/2 and HTTP/3 compatibilityS: Web application using modern protocols
T: Leverage HTTP/2 for performance benefits
A: Load balancer protocol handling created subtle issues
R: Degraded performance instead of expected improvements
- Protocol translation problems
- Stream prioritization issues
- Header compression inefficiencies
- Multiplexing limitations
- Backward compatibility gaps
Protocol Optimization Pattern
with end-to-end protocol preservation
- HTTP/2 performance degradation through load balancers
- HTTP/3 deployment challenges
- Performance regression from protocol translation
Large request handlingS: File upload functionality in web application
T: Support large file transfers through load balancer
A: Request size limits caused upload failures
R: Failed uploads and poor user experience
- Default size limitations
- Fixed buffer allocations
- Timeout misconfigurations
- Missing streaming support
- In-memory request processing
Streaming Transfer Pattern
with buffer tuning and timeout adjustments
- File upload failures through load balancers
- Media transfer timeouts
- Large API request failures
Header manipulation issuesS: Application requiring client IP preservation
T: Maintain original client IP information
A: Improper header handling lost source IP data
R: Security features and logging relying on IP data failed
- Missing X-Forwarded-For headers
- Incorrect header handling
- Proxy protocol disabled
- Header rewriting issues
- Trust boundary confusion
Client Identity Preservation Pattern
with consistent header management
- Web application firewall bypasses
- Geographic restriction failures
- IP-based security control ineffectiveness

Advanced Features & Integration Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Content-based routing failuresS: Multi-tenant SaaS application
T: Route requests based on subdomain and content
A: Complex content inspection rules had edge case failures
R: Requests routed to incorrect tenant environments
- Overly complex inspection rules
- Unhandled edge cases
- Performance vs. precision tradeoffs
- Missing rule validation
- Pattern matching limitations
Layered Routing Pattern
with progressive refinement and validation
- Multi-tenant environment cross-talk
- Subdomain-based routing failures
- Application path routing inconsistencies
WAF integration issuesS: E-commerce site with integrated WAF
T: Protect application from attacks while allowing legitimate traffic
A: WAF rule false positives blocked valid transactions
R: Lost sales and customer frustration
- Overly aggressive WAF rules
- Missing tuning processes
- Rule deployment without testing
- Inadequate monitoring
- All-or-nothing WAF activation
Progressive Security Pattern
with phased rule deployment and monitoring
- WAF blocking legitimate customer traffic
- False positive security blocks during promotions
- Shopping cart abandonment from security interference
Service mesh integrationS: Kubernetes platform with service mesh
T: Integrate external and mesh load balancing
A: Conflicting load balancing decisions between layers
R: Inconsistent routing and unpredictable performance
- Overlapping responsibility domains
- Multiple balancing layers
- Inconsistent algorithms
- Session tracking conflicts
- Observability gaps between layers
Complementary Balancing Pattern
with clear responsibility separation
- Istio and external load balancer conflicts
- Linkerd traffic splitting inconsistencies
- Kubernetes Ingress and service mesh interaction problems
Global server load balancing complexityS: Multi-region application deployment
T: Route users to optimal global region
A: GSLB decisions conflicted with CDN and DNS load balancing
R: Suboptimal user routing and inconsistent experience
- Multiple routing decision points
- Inconsistent health checking
- Different routing algorithms
- Independent configuration
- Lack of coordinated approach
Hierarchical Traffic Management Pattern
with coordinated decision making
- Global/local load balancing decision conflicts
- CDN and origin load balancer inconsistencies
- DNS and HTTP routing layer conflicts

Infrastructure & Hardware Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Hardware failure handlingS: Physical load balancer appliance deployment
T: Maintain availability despite hardware issues
A: Degraded hardware performance not detected until failure
R: Sudden outage without graceful degradation
- Binary health view
- Limited hardware monitoring
- Missing early warning detection
- Inadequate spare capacity
- Optimistic failure planning
Proactive Replacement Pattern
with predictive monitoring and scheduled rotation
- Hardware load balancer complete failures
- Network interface degradation affecting throughput
- Power subsystem partial failures
Network infrastructure dependenciesS: Load balancer deployment in cloud environment
T: Ensure network capacity for peak traffic
A: Underlying network infrastructure limits reached
R: Load balancer performance degraded despite available capacity
- Unclear infrastructure limitations
- Missing capacity testing
- Inadequate network monitoring
- Abstracted dependency visibility
- Fixed network allocations
Infrastructure-Aware Scaling Pattern
with comprehensive dependency mapping
- Cloud load balancer network throughput limitations
- Virtual network bottlenecks
- Bandwidth constraint issues during peaks
Asymmetric routingS: Multi-path network environment
T: Balance traffic across redundant links
A: Return traffic took different path than request
R: Connection failures and performance issues
- Inconsistent routing tables
- Equal-cost multi-path issues
- Source routing limitations
- Connection tracking gaps
- Stateful inspection problems
Symmetric Flow Pattern
with flow pinning and consistent routing
- Stateful firewall connection failures
- Multi-homed server connection issues
- Load balancer cluster state inconsistencies
Virtualization layer limitationsS: Virtualized load balancer deployment
T: Achieve physical performance in virtual environment
A: Hypervisor contention limited performance
R: Inconsistent performance despite adequate virtual resources
- Shared resource contention
- Noisy neighbor problems
- Missing resource guarantees
- Hypervisor overhead
- Network I/O limitations
Resource Isolation Pattern
with guaranteed resources and performance testing
- Virtual load balancer performance inconsistency
- Virtual appliance throughput limitations
- Hypervisor scheduling impact on packet processing

Cloud & Distributed Architecture Issues #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Multi-cloud consistencyS: Application deployed across cloud providers
T: Maintain consistent load balancing across environments
A: Different load balancer capabilities created inconsistencies
R: Different user experience depending on routing destination
- Provider-specific implementations
- Feature disparity between clouds
- Configuration drift
- Independent management
- Missing cross-cloud visibility
Abstraction Layer Pattern
with normalized configuration and unified management
- Hybrid cloud deployment inconsistencies
- Multi-cloud failover problems
- Configuration synchronization challenges
Autoscaling integrationS: Cloud-native application with elastic scaling
T: Coordinate backend scaling with load balancer configuration
A: Load balancer unaware of autoscaling group changes
R: Traffic not distributed to new instances, wasting capacity
- Manual scaling notification
- Delayed registration
- Missing scaling hooks
- Inconsistent health checking
- Registration race conditions
Integrated Scaling Pattern
with coordinated scaling events and readiness checks
- AWS ELB/ASG registration delays
- Kubernetes service endpoint update latency
- Azure VMSS and load balancer coordination issues
Container environment challengesS: Containerized application deployment
T: Load balance ephemeral container instances
A: Rapid container churn overwhelmed load balancer updates
R: Routing errors and connection interruptions
- Slow configuration updates
- Instance registration overhead
- IP address reuse issues
- Short-lived backend challenges
- Update propagation delays
Ephemeral Endpoint Pattern
with optimized registration and fast updates
- Kubernetes pod churn affecting load balancers
- Docker container IP reuse problems
- Service mesh data plane update storms
Cost optimization balanceS: Cloud infrastructure cost management
T: Optimize load balancer costs while maintaining capacity
A: Aggressive cost optimization led to insufficient capacity
R: Performance degradation during traffic spikes
- Cost-focused right-sizing
- Inadequate peak planning
- Missing cost-performance

Solutions #

Okay, let’s break down how to implement the canonical solution patterns for the Performance & Scalability issues using common load balancer configurations (like Cloud Providers, NGINX, HAProxy).

Remember, load balancer configuration is typically declarative (YAML, text files) rather than procedural code like Java or Rust. However, understanding these concepts is crucial for building robust systems.


Scalable Load Balancing Pattern** #

  • Problem Solved: Throughput saturation (Packet Processing, Bandwidth, CPU limits on the LB itself).

  • Core Idea: Don’t rely on a single, fixed-size load balancer. Distribute the load balancing function itself horizontally and ensure ample capacity.

  • Implementation:

    • Cloud Providers (AWS ELB/ALB/NLB, Azure Load Balancer, GCP Cloud Load Balancing):

      • How: These services are designed for this pattern. They are managed services that automatically scale their underlying capacity based on traffic demand. You don’t manage individual LB instances.
      • Configuration/APIs:
        • Choose the Right Type:
          • AWS: Use Network Load Balancer (NLB) for highest network throughput (millions of PPS), TCP/TLS traffic, and static IPs. Use Application Load Balancer (ALB) for HTTP/S routing, path-based routing, WAF integration, etc. (scales very well but has different capacity characteristics than NLB).
          • Azure: Standard Load Balancer scales automatically. Choose based on L4 vs L7 needs (Application Gateway for L7).
          • GCP: Choose appropriate type (e.g., Global External HTTP(S) LB, Regional Network LB). They scale managedly.
        • Backend Autoscaling: Ensure your backend instances (EC2, VMs, Containers, Functions) are managed by an Autoscaling Group (ASG) or equivalent (Managed Instance Group, VM Scale Set). The load balancer automatically adds/removes targets as the group scales.
        • Monitoring: Monitor the load balancer’s own metrics (e.g., AWS ConsumedLCUs for ALB, ActiveFlowCount for NLB, Azure SNAT Connection Count, GCP LB metrics) to understand usage, though scaling is generally automatic up to service limits. Request limit increases if needed.
        • Headroom: Configure backend ASG policies to scale proactively based on metrics like CPU, request count per instance, or queue depth, ensuring there’s always spare backend capacity before the load balancer needs to throttle.
    • Self-Managed (NGINX, HAProxy):

      • How: You need to run multiple instances of NGINX/HAProxy and distribute traffic to them.
      • Configuration/APIs:
        • Distribution Layer:
          • DNS Round Robin: Simplest but uneven distribution and slow failover.
          • Cloud Network LB: Place an AWS NLB, Azure Standard LB (L4), or GCP Network LB in front of your NGINX/HAProxy instances. The cloud LB handles scaling and distribution to your LB tier.
          • BGP/ECMP (On-Prem/Advanced): Use routing protocols to advertise the same IP from multiple LB instances, letting routers distribute traffic (requires network team involvement).
        • High Availability (HA): Use tools like keepalived (VRRP) for active/passive or active/active setups if not using a higher-level distribution method like a cloud NLB.
        • Instance Scaling: Run your NGINX/HAProxy instances within an Autoscaling Group (if in the cloud) so the LB tier itself can scale.
        • Configuration Example (Conceptual - using Cloud NLB):
          [ Internet ] -> [ AWS NLB / Azure LB / GCP Network LB ] -> [ ASG of NGINX/HAProxy Instances ] -> [ Backend App Servers ]
          
        • NGINX/HAProxy Config: The individual configs don’t change much for this pattern, it’s about deploying multiple copies effectively.

Connection Management Pattern** #

  • Problem Solved: Connection exhaustion (running out of connection table entries, ephemeral ports, or file descriptors).

  • Core Idea: Tune operating system and load balancer limits, manage timeouts effectively, and potentially limit connections per client.

  • Implementation:

    • Cloud Providers:

      • How: Limits are often high but exist. Focus on timeouts and choosing the right LB type.
      • Configuration/APIs:
        • Check Service Quotas: Understand the documented connection limits for your chosen LB type (e.g., AWS ALB/NLB limits, Azure LB limits).
        • Idle Timeouts: Configure appropriate idle timeouts. Longer timeouts hold connections open; shorter timeouts reclaim resources faster but might break long-polling/WebSockets. (e.g., AWS ALB idle_timeout.timeout_seconds, Azure LB Idle Timeout). NLBs are generally better for very long-lived connections.
        • Client IP Preservation: Ensure backend servers see the actual client IP (using Proxy Protocol or X-Forwarded-For) if they need to manage connections based on client.
    • Self-Managed (NGINX, HAProxy):

      • How: Direct control over OS and LB parameters.
      • Configuration/APIs:
        • Operating System Tuning (sysctl.conf, limits.conf):
          • net.core.somaxconn: Max backlog of pending connections (e.g., 65535).
          • net.ipv4.tcp_max_syn_backlog: Max SYN queue size (e.g., 65535).
          • fs.file-max: System-wide max file descriptors (e.g., 200000 or more).
          • nofile (via ulimit or /etc/security/limits.conf or systemd unit file): Max file descriptors per process (e.g., 65535 or 1048576). Connections consume file descriptors.
        • NGINX (nginx.conf):
          • worker_processes auto; (Usually set to number of CPU cores).
          • worker_rlimit_nofile 65535; (Set worker process file descriptor limit).
          • events { worker_connections 16384; } (Max connections per worker. Total ≈ worker_processes * worker_connections, limited by nofile).
          • keepalive_timeout 65; (Adjust idle connection timeout).
          • keepalive_requests 1000; (Max requests per keepalive connection).
          • client_header_timeout, client_body_timeout, send_timeout: Tune client timeouts.
          • Client Limiting:
            http {
                limit_conn_zone $binary_remote_addr zone=addr:10m; # Define zone based on client IP
                server {
                    ...
                    location / {
                        limit_conn addr 10; # Limit to 10 concurrent connections per IP
                    }
                }
            }
            
        • HAProxy (haproxy.cfg):
          • global:
            • maxconn 100000 (Global max concurrent connections).
            • tune.maxaccept 100 (How many connections to accept at once).
            • ulimit-n 200050 (Set nofile limit directly).
          • defaults or frontend:
            • maxconn 50000 (Per-frontend limit).
            • timeout client 30s (Adjust client idle timeout).
            • timeout server 30s (Adjust server idle timeout).
            • timeout connect 5s (Backend connection attempt timeout).
          • Client Limiting:
            backend app
                stick-table type ip size 1m expire 30s store conn_cur # Store current connections per IP
                tcp-request connection track-sc1 src
                tcp-request connection reject if { sc1_conn_cur ge 10 } # Reject if IP has >= 10 connections
            

Optimized TLS Pattern** #

  • Problem Solved: SSL/TLS processing bottlenecks (high CPU usage during handshakes).

  • Core Idea: Use hardware acceleration where possible, optimize cipher selection, and enable session reuse mechanisms.

  • Implementation:

    • Cloud Providers:

      • How: Leverage built-in optimizations and hardware acceleration.
      • Configuration/APIs:
        • Hardware Acceleration: Usually handled automatically by the managed service.
        • Security Policies: Select predefined TLS Security Policies (e.g., AWS ELBSecurityPolicy-TLS-1-2-Ext-2018-06, Azure App Gateway SSL Policies) that balance security and performance. Avoid legacy policies unless absolutely required. Check cipher suites included in the policy.
        • Session Resumption: Typically enabled by default (Session IDs/Tickets). Verify in documentation if specific controls exist.
        • Protocols: Ensure modern protocols like TLS 1.2 and TLS 1.3 are enabled in the policy.
    • Self-Managed (NGINX, HAProxy):

      • How: Explicit configuration of ciphers, protocols, session caching, and potentially linking against hardware acceleration libraries.
      • Configuration/APIs:
        • Hardware Acceleration: If your hardware supports it (e.g., Intel QAT, specialized cards), you may need specific builds of NGINX/HAProxy or OpenSSL compiled with engine support. Consult hardware/software documentation.
        • NGINX (nginx.conf - within server block listen ... ssl):
          ssl_protocols TLSv1.2 TLSv1.3;
          ssl_prefer_server_ciphers on;
          # Example: Modern cipher suite prioritizing AEAD ciphers
          ssl_ciphers 'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384';
          ssl_ecdh_curve prime256v1:secp384r1; # Efficient curves
          
          # Session Cache (Session IDs)
          ssl_session_cache shared:SSL:10m; # 10MB cache (~40,000 sessions)
          ssl_session_timeout 10m;
          
          # Session Tickets (Alternative/complementary, keys need rotation)
          ssl_session_tickets on;
          # ssl_session_ticket_key /path/to/ticket.key; # Needs secure generation and rotation
          
        • HAProxy (haproxy.cfg - within bind line):
          # Global or frontend/defaults section
          tune.ssl.default-dh-param 2048
          ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305
          ssl-default-bind-options ssl-min-ver TLSv1.2 # Enforce minimum TLS version
          
          frontend myfrontend
             bind *:443 ssl crt /path/to/cert.pem alpn h2,http/1.1 # Add ciphers/options here if not default
             # Session resumption enabled by default, session cache automatic
          

Coordinated Scaling Pattern** #

  • Problem Solved: Uneven scaling across tiers in a multi-layer LB architecture (e.g., external LB scales but internal LB doesn’t, creating a bottleneck).

  • Core Idea: Link or coordinate the scaling actions across different tiers based on relevant, potentially cross-tier, metrics.

  • Implementation: (This is more architectural and automation-focused than specific LB config)

    • Cloud Providers:

      • How: Use cloud monitoring and autoscaling features, potentially with custom logic.
      • Configuration/APIs:
        • Tiered Monitoring: Set up monitoring (CloudWatch, Azure Monitor, Google Cloud Monitoring) for key metrics at each tier (e.g., Edge LB connections/throughput, Internal LB active connections/latency, App Server CPU/Memory/Queue).
        • Independent Scaling: Ensure each tier (Edge LB Target Group ASG, Internal LB Target Group ASG, App Tier ASG) has its own autoscaling policies based on its most relevant metrics.
        • Cross-Tier Triggers (Advanced):
          • Use custom metrics or Lambda/Functions triggered by alarms. Example: If Edge LB ActiveFlowCount > Threshold AND Internal LB TargetResponseTime > Threshold, trigger scaling of the Internal LB’s ASG or even the App Tier’s ASG.
          • AWS: CloudWatch Metric Math can create combined metrics for alarms. Step Functions or Lambda can implement more complex scaling logic.
          • Azure: Azure Monitor action groups can trigger Azure Functions or Logic Apps.
          • GCP: Cloud Monitoring alerting can trigger Cloud Functions or Pub/Sub.
        • Capacity Headroom: Over-provision slightly or set scaling thresholds lower at downstream tiers so they can react before becoming bottlenecks when an upstream tier scales up traffic.
    • Self-Managed (NGINX, HAProxy) + Orchestration (Kubernetes, etc.):

      • How: Combine LB metrics with orchestration platform scaling.
      • Configuration/APIs:
        • Metrics Exposure: Ensure your LBs expose detailed metrics (e.g., using nginx-prometheus-exporter, HAProxy’s built-in Prometheus endpoint, or stats socket).
        • Monitoring System: Scrape metrics using Prometheus or similar.
        • Kubernetes HPA: Use Horizontal Pod Autoscalers.
          • Scale LB deployments based on their own CPU/Memory or custom metrics (e.g., active connections scraped by Prometheus Adapter).
          • Scale backend application deployments based on their CPU/Memory, custom metrics (e.g., RPS per pod), or external metrics (like queue length).
        • Custom Controllers/Operators: For complex coordination logic (similar to the cloud Lambda approach), a custom Kubernetes operator might observe metrics across different deployments/services and adjust HPA targets or replica counts directly.
        • Service Mesh: If using a service mesh (Istio, Linkerd), leverage its observability and potentially traffic splitting capabilities to manage scaling and load distribution internally. The external LB still needs to scale appropriately.

There's no articles to list here yet.