Table of Contents
Load Balancer Issues, Incidents, and Mitigation Strategies #
Performance & Scalability Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Throughput saturation | S: E-commerce platform during flash sale T: Handle 20x normal traffic volume A: Load balancer reached packet processing limits R: Dropped connections and failed purchases despite backend capacity | - Fixed capacity planning - Single-instance design - Undersized hardware/instances - CPU-bound processing - Limited network capacity | Scalable Load Balancing Pattern with horizontal scaling and capacity headroom | - Black Friday e-commerce outages - AWS ELB scaling limitations during spikes - Gaming platform launch failures |
Connection exhaustion | S: Video streaming service during major event T: Support millions of concurrent connections A: Load balancer exhausted connection table capacity R: New connection failures while existing streams continued | - Fixed connection table limits - Long-lived connections - Incomplete connection lifecycle management - Missing connection limits per client - Conservative resource allocation | Connection Management Pattern with dynamic table sizing and client limiting | - Sporting event streaming failures - F5 connection table exhaustion - Netflix connection limit incidents |
SSL/TLS processing bottlenecks | S: Banking platform with high security requirements T: Maintain performance while using strong encryption A: SSL/TLS handshakes consumed excessive CPU during traffic spike R: High latency and connection timeouts | - Software-based TLS processing - Missing session caching - Expensive cipher suites - Full handshakes for all connections - Insufficient crypto acceleration | Optimized TLS Pattern with hardware acceleration and session reuse | - HTTPS performance degradation during peaks - TLS 1.3 migration performance impacts - Financial service encryption bottlenecks |
Uneven scaling across tiers | S: Multi-layer load balancing architecture T: Scale frontend and backend load balancers proportionally A: Frontend scaled but backend load balancers did not R: Internal bottlenecks despite external capacity | - Independent scaling mechanisms - Tier-specific monitoring - Manual scaling processes - Inconsistent scaling metrics - Mismatched capacity planning | Coordinated Scaling Pattern with cross-tier metrics and proportional scaling | - Internal API gateway bottlenecks - AWS ALB-to-NLB scaling mismatches - Service mesh proxy scaling inconsistencies |
Health Checking & Backend Management Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Inadequate health checks | S: Payment processing service T: Ensure traffic routed only to healthy backends A: Simple TCP checks missed application-level failures R: Transactions routed to malfunctioning servers causing failures | - Connection-only health checking - Missing functional verification - Shallow health indicators - Infrequent checking - Binary (healthy/unhealthy) view | Deep Health Checking Pattern with application-level verification | - Payment gateway routing to failed servers - HAProxy backend failures despite “UP” status - NGINX basic check limitations |
Thundering herd on recovery | S: Web service after backend maintenance T: Smoothly restore traffic to recovered backends A: All traffic immediately routed to new backends R: Recovered servers overwhelmed, causing secondary outage | - Binary backend status - Immediate full restoration - Missing warm-up periods - Connection-count unawareness - Traffic flood on status change | Gradual Recovery Pattern with controlled traffic ramping | - Database connection pool exhaustion after maintenance - Cache server overload after restarts - Application server CPU spikes after deployment |
Flapping detection failures | S: Microservice platform with unstable component T: Prevent traffic to repeatedly failing backends A: Backend rapidly cycled between healthy/unhealthy R: Connection errors and latency spikes from constant rebalancing | - Simplistic health thresholds - Missing flap detection - Immediate state changes - Short health check memory - Aggressive health checking | Hysteresis Detection Pattern with state dampening and stabilization periods | - Service mesh circuit breaker flapping - F5 pool member oscillation - Kubernetes readiness probe flapping |
Slow backend detection | S: Database-backed web application T: Identify and avoid underperforming backends A: Load balancer failed to detect gradually slowing servers R: Overall service degradation as traffic continued to slow instances | - Binary health model - Missing performance metrics - Response time blindness - Availability-only focus - Aggregate-only monitoring | Performance-Aware Balancing Pattern with latency-based routing and ejection | - Database read replica performance degradation - AWS ELB sending traffic to overwhelmed instances - Content delivery performance inconsistencies |
Traffic Distribution & Routing Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Session persistence failures | S: E-commerce shopping cart application T: Maintain user session on same backend A: Session affinity mechanism failed during scaling event R: Cart abandonment due to lost sessions and frustrated users | - Cookie-based persistence only - Missing fallback mechanisms - Persistence table limitations - Aggressive timeout policies - Backend-stored session state | Multi-mechanism Persistence Pattern with layered affinity strategies | - Shopping cart session loss incidents - Banking application authentication loops - Gaming platform inventory inconsistencies |
Suboptimal load distribution | S: Content delivery application T: Distribute load evenly across backend capacity A: Round-robin distribution led to backend hotspots R: Some servers overloaded while others underutilized | - Simplistic distribution algorithms - Missing backend telemetry - Request cost unawareness - Uniform request assumption - Connection-focused balancing | Weighted Response Balancing Pattern with cost-aware distribution | - API backend hotspots despite balancing - Database read replica load imbalance - Distributed cache node utilization skew |
Geographical routing errors | S: Global application with regional deployments T: Route users to nearest geographical deployment A: IP-based geolocation routed users to distant regions R: Unnecessary latency and poor user experience | - IP-only geolocation - Static routing tables - Missing client feedback - Regional capacity ignorance - Outdated geolocation data | Dynamic Geolocation Pattern with performance-based decisions and client input | - CDN regional routing issues - Global DNS load balancing inaccuracies - Mobile application regional routing problems |
Layer 7 routing regression | S: Microservice platform with path-based routing T: Route requests to correct service based on URL path A: Regular expression rule regression caused misrouting R: Requests sent to wrong backends, causing errors | - Complex routing rules - Manual rule management - Missing rule verification - Order-dependent rules - Pattern overlap conflicts | Verified Routing Pattern with automated rule validation and testing | - Kubernetes Ingress routing conflicts - NGINX location block precedence issues - HAProxy ACL rule conflicts |
Availability & Resilience Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Single point of failure | S: Enterprise application infrastructure T: Maintain continuous availability A: Primary load balancer failed with no automatic failover R: Complete service outage despite redundant backends | - Single load balancer instance - Manual failover procedures - Active-passive only design - Shared fate with management plane - Configuration synchronization issues | Distributed Load Balancing Pattern with active-active deployment and autonomous operation | - Single load balancer failure outages - F5 active-standby failover issues - AWS NLB availability zone isolation failures |
Failover mechanism failures | S: Financial trading platform T: Automatically recover from zone failure A: Failover mechanism failed to activate properly R: Extended outage requiring manual intervention | - Untested failover paths - Complex failover conditions - Missing failover monitoring - Partial state transfer - Split-brain possibilities | Practiced Failover Pattern with regular testing and simplified mechanisms | - DNS failover propagation delays - Global load balancer failover complexity - Float IP takeover failures |
Configuration synchronization issues | S: High-availability load balancer pair T: Maintain consistent configuration across HA pair A: Configuration drift between primary and secondary R: Service disruption during failover due to inconsistent state | - Asynchronous config updates - Manual configuration changes - Missing sync verification - Config push failures - Incomplete state replication | Atomic Configuration Pattern with verified synchronization and consistency checks | - F5 configuration sync failures - NGINX Plus state synchronization issues - HAProxy configuration inconsistencies |
Degraded mode operation failures | S: E-commerce infrastructure partial outage T: Maintain core functions during component failures A: Load balancer removed all instances during partial backend failure R: Complete outage instead of degraded service | - All-or-nothing availability model - Missing graceful degradation - Strict health requirements - Backend interdependencies - Cascading failure patterns | Graceful Degradation Pattern with tiered availability and functionality reduction | - E-commerce complete outages during partial failures - Content site unavailability during backend degradation - Banking service hard failures instead of reduced functionality |
Security Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
TLS configuration vulnerabilities | S: Financial services web application T: Secure sensitive customer transactions A: Load balancer configured with weak cipher suites R: Vulnerability to TLS downgrade attacks | - Legacy protocol support - Weak cipher suites - Missing security headers - Outdated TLS versions - Default configurations | TLS Hardening Pattern with regular security assessment and baseline enforcement | - BEAST and POODLE vulnerability exposure - PCI compliance failures from TLS misconfigurations - Healthcare data protection regulation violations |
DDoS protection deficiencies | S: Public-facing web service T: Maintain availability during attack A: Volumetric attack overwhelmed load balancer capacity R: Service unavailability affecting legitimate users | - Limited capacity planning - Missing traffic filtering - Lack of rate limiting - Inadequate traffic analysis - Single-tier protection | Defense-in-Depth DDoS Pattern with multi-level protections and traffic scrubbing | - Major gaming platform DDoS outages - Financial services availability SLA violations - Media site unavailability during attacks |
SSL/TLS termination data exposure | S: Healthcare patient portal T: Protect sensitive data in transit A: Unencrypted traffic after load balancer exposed PHI R: Compliance violation and data protection failure | - TLS termination without re-encryption - Clear-text internal traffic - Security boundary confusion - Missing internal controls - End-to-end encryption gaps | End-to-End Encryption Pattern with internal traffic protection and sensitive data awareness | - Healthcare data exposure incidents - Financial transaction cleartext exposure - Personal data protection regulation violations |
Load balancer access controls | S: Enterprise management infrastructure T: Restrict load balancer administration A: Weak admin credentials led to unauthorized configuration changes R: Service disruption from malicious rule modifications | - Default/weak credentials - Unnecessary access exposure - Missing MFA - Excessive permissions - Inadequate access logging | Least Privilege Management Pattern with strong authentication and authorization controls | - Network device compromise via management interfaces - Unauthorized load balancer rule changes - Admin interface exposure on public networks |
Configuration & Operational Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Configuration complexity | S: Large enterprise load balancer infrastructure T: Manage complex routing rules across application portfolio A: Rule complexity led to conflicting configurations R: Unexpected routing behavior and service disruptions | - Accumulated configuration cruft - Multiple administrator changes - Missing documentation - Configuration sprawl - Lack of modular design | Configuration as Code Pattern with version control and automated validation | - F5 iRule complexity causing outages - NGINX configuration conflicts - HAProxy ACL rule interaction bugs |
Certificate management failures | S: E-commerce platform handling customer payments T: Maintain valid TLS certificates A: TLS certificate expiration went unnoticed R: Complete payment processing outage | - Manual certificate processes - Missing expiration monitoring - Certificate sprawl - Siloed responsibility - Inadequate renewal automation | Automated Certificate Pattern with lifecycle management and monitoring | - Major website outages from certificate expirations - Load balancer TLS errors during expired certificate usage - Certificate chain validation failures |
Change management incidents | S: Business-critical application infrastructure T: Update load balancer configuration safely A: Untested configuration change caused routing errors R: Application unavailability during business hours | - Direct production changes - Missing staging environment - Inadequate testing - Manual change processes - Insufficient rollback planning | Progressive Deployment Pattern with canary testing and automated rollback | - Global outages from load balancer changes - Banking service disruption during planned updates - E-commerce unavailability from routing rule changes |
Capacity planning failures | S: Retail platform during seasonal peak T: Scale load balancing capacity for holiday shopping A: Insufficient capacity planning led to resource exhaustion R: Throttled connections and degraded user experience | - Historical-only capacity planning - Missing headroom buffers - Reactive scaling processes - Inadequate load testing - Fixed capacity assumptions | Predictive Capacity Pattern with trend analysis and proactive scaling | - Black Friday retail platform failures - Streaming service degradation during popular events - Ticketing system failures during high-demand sales |
Monitoring & Troubleshooting Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Limited visibility | S: Complex application performance issue T: Determine source of increased latency A: Insufficient load balancer metrics obscured bottleneck location R: Extended troubleshooting time and prolonged performance impact | - Basic-only metrics - Missing detailed logging - Aggregate-only statistics - Limited historical data - Insufficient granularity | Comprehensive Telemetry Pattern with detailed metrics and transaction sampling | - Performance root cause analysis delays - Load balancer versus application blame attribution - Network versus application troubleshooting challenges |
Log management deficiencies | S: Security incident investigation T: Determine source and scope of suspicious activity A: Insufficient load balancer logging hampered investigation R: Incomplete understanding of attack vector and impact | - Minimal logging configuration - Short log retention - Missing security event logging - Storage constraint concerns - Performance impact fears | Security-Focused Logging Pattern with comprehensive audit trails and sufficient retention | - Incomplete security investigation data - Compliance finding for inadequate access logs - Attack forensics limitations from log gaps |
Alert configuration problems | S: Off-hours infrastructure incident T: Promptly notify team of service degradation A: Missing or misconfigured alerts delayed response R: Extended outage duration and increased business impact | - Missing critical alerts - Alert thresholds too lenient - Alert fatigue from noise - Uncorrelated alert flood - Incomplete alerting coverage | Hierarchical Alerting Pattern with severity-based routing and correlation | - Delayed incident response from missing alerts - Alert storms during partial outages - Missed critical conditions despite monitoring |
Misleading metrics | S: Customer-reported application slowness T: Validate performance using monitoring systems A: Load balancer metrics showed normal despite issues R: Delayed troubleshooting and resolution | - Aggregate-only metrics - Misleading averages - Missing percentile measurements - Backend-unaware metrics - Client experience blindness | User-Centric Monitoring Pattern with end-to-end measurements and percentile metrics | - “Green” dashboards during user-experienced outages - Load balancer metrics missing critical performance indicators - Delayed incident detection despite monitoring |
Protocol & Traffic Handling Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
WebSocket handling problems | S: Real-time collaboration application T: Support persistent WebSocket connections A: Load balancer terminated WebSockets prematurely R: Frequent disconnections disrupting user collaboration | - Short connection timeouts - HTTP-focused configuration - Missing protocol awareness - Inadequate idle detection - Connection resource constraints | Long-lived Connection Pattern with protocol-aware configuration and management | - Chat application disconnection issues - Real-time collaboration tool instability - WebSocket timeout problems in gaming applications |
HTTP/2 and HTTP/3 compatibility | S: Web application using modern protocols T: Leverage HTTP/2 for performance benefits A: Load balancer protocol handling created subtle issues R: Degraded performance instead of expected improvements | - Protocol translation problems - Stream prioritization issues - Header compression inefficiencies - Multiplexing limitations - Backward compatibility gaps | Protocol Optimization Pattern with end-to-end protocol preservation | - HTTP/2 performance degradation through load balancers - HTTP/3 deployment challenges - Performance regression from protocol translation |
Large request handling | S: File upload functionality in web application T: Support large file transfers through load balancer A: Request size limits caused upload failures R: Failed uploads and poor user experience | - Default size limitations - Fixed buffer allocations - Timeout misconfigurations - Missing streaming support - In-memory request processing | Streaming Transfer Pattern with buffer tuning and timeout adjustments | - File upload failures through load balancers - Media transfer timeouts - Large API request failures |
Header manipulation issues | S: Application requiring client IP preservation T: Maintain original client IP information A: Improper header handling lost source IP data R: Security features and logging relying on IP data failed | - Missing X-Forwarded-For headers - Incorrect header handling - Proxy protocol disabled - Header rewriting issues - Trust boundary confusion | Client Identity Preservation Pattern with consistent header management | - Web application firewall bypasses - Geographic restriction failures - IP-based security control ineffectiveness |
Advanced Features & Integration Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Content-based routing failures | S: Multi-tenant SaaS application T: Route requests based on subdomain and content A: Complex content inspection rules had edge case failures R: Requests routed to incorrect tenant environments | - Overly complex inspection rules - Unhandled edge cases - Performance vs. precision tradeoffs - Missing rule validation - Pattern matching limitations | Layered Routing Pattern with progressive refinement and validation | - Multi-tenant environment cross-talk - Subdomain-based routing failures - Application path routing inconsistencies |
WAF integration issues | S: E-commerce site with integrated WAF T: Protect application from attacks while allowing legitimate traffic A: WAF rule false positives blocked valid transactions R: Lost sales and customer frustration | - Overly aggressive WAF rules - Missing tuning processes - Rule deployment without testing - Inadequate monitoring - All-or-nothing WAF activation | Progressive Security Pattern with phased rule deployment and monitoring | - WAF blocking legitimate customer traffic - False positive security blocks during promotions - Shopping cart abandonment from security interference |
Service mesh integration | S: Kubernetes platform with service mesh T: Integrate external and mesh load balancing A: Conflicting load balancing decisions between layers R: Inconsistent routing and unpredictable performance | - Overlapping responsibility domains - Multiple balancing layers - Inconsistent algorithms - Session tracking conflicts - Observability gaps between layers | Complementary Balancing Pattern with clear responsibility separation | - Istio and external load balancer conflicts - Linkerd traffic splitting inconsistencies - Kubernetes Ingress and service mesh interaction problems |
Global server load balancing complexity | S: Multi-region application deployment T: Route users to optimal global region A: GSLB decisions conflicted with CDN and DNS load balancing R: Suboptimal user routing and inconsistent experience | - Multiple routing decision points - Inconsistent health checking - Different routing algorithms - Independent configuration - Lack of coordinated approach | Hierarchical Traffic Management Pattern with coordinated decision making | - Global/local load balancing decision conflicts - CDN and origin load balancer inconsistencies - DNS and HTTP routing layer conflicts |
Infrastructure & Hardware Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Hardware failure handling | S: Physical load balancer appliance deployment T: Maintain availability despite hardware issues A: Degraded hardware performance not detected until failure R: Sudden outage without graceful degradation | - Binary health view - Limited hardware monitoring - Missing early warning detection - Inadequate spare capacity - Optimistic failure planning | Proactive Replacement Pattern with predictive monitoring and scheduled rotation | - Hardware load balancer complete failures - Network interface degradation affecting throughput - Power subsystem partial failures |
Network infrastructure dependencies | S: Load balancer deployment in cloud environment T: Ensure network capacity for peak traffic A: Underlying network infrastructure limits reached R: Load balancer performance degraded despite available capacity | - Unclear infrastructure limitations - Missing capacity testing - Inadequate network monitoring - Abstracted dependency visibility - Fixed network allocations | Infrastructure-Aware Scaling Pattern with comprehensive dependency mapping | - Cloud load balancer network throughput limitations - Virtual network bottlenecks - Bandwidth constraint issues during peaks |
Asymmetric routing | S: Multi-path network environment T: Balance traffic across redundant links A: Return traffic took different path than request R: Connection failures and performance issues | - Inconsistent routing tables - Equal-cost multi-path issues - Source routing limitations - Connection tracking gaps - Stateful inspection problems | Symmetric Flow Pattern with flow pinning and consistent routing | - Stateful firewall connection failures - Multi-homed server connection issues - Load balancer cluster state inconsistencies |
Virtualization layer limitations | S: Virtualized load balancer deployment T: Achieve physical performance in virtual environment A: Hypervisor contention limited performance R: Inconsistent performance despite adequate virtual resources | - Shared resource contention - Noisy neighbor problems - Missing resource guarantees - Hypervisor overhead - Network I/O limitations | Resource Isolation Pattern with guaranteed resources and performance testing | - Virtual load balancer performance inconsistency - Virtual appliance throughput limitations - Hypervisor scheduling impact on packet processing |
Cloud & Distributed Architecture Issues #
Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
---|---|---|---|---|
Multi-cloud consistency | S: Application deployed across cloud providers T: Maintain consistent load balancing across environments A: Different load balancer capabilities created inconsistencies R: Different user experience depending on routing destination | - Provider-specific implementations - Feature disparity between clouds - Configuration drift - Independent management - Missing cross-cloud visibility | Abstraction Layer Pattern with normalized configuration and unified management | - Hybrid cloud deployment inconsistencies - Multi-cloud failover problems - Configuration synchronization challenges |
Autoscaling integration | S: Cloud-native application with elastic scaling T: Coordinate backend scaling with load balancer configuration A: Load balancer unaware of autoscaling group changes R: Traffic not distributed to new instances, wasting capacity | - Manual scaling notification - Delayed registration - Missing scaling hooks - Inconsistent health checking - Registration race conditions | Integrated Scaling Pattern with coordinated scaling events and readiness checks | - AWS ELB/ASG registration delays - Kubernetes service endpoint update latency - Azure VMSS and load balancer coordination issues |
Container environment challenges | S: Containerized application deployment T: Load balance ephemeral container instances A: Rapid container churn overwhelmed load balancer updates R: Routing errors and connection interruptions | - Slow configuration updates - Instance registration overhead - IP address reuse issues - Short-lived backend challenges - Update propagation delays | Ephemeral Endpoint Pattern with optimized registration and fast updates | - Kubernetes pod churn affecting load balancers - Docker container IP reuse problems - Service mesh data plane update storms |
Cost optimization balance | S: Cloud infrastructure cost management T: Optimize load balancer costs while maintaining capacity A: Aggressive cost optimization led to insufficient capacity R: Performance degradation during traffic spikes | - Cost-focused right-sizing - Inadequate peak planning - Missing cost-performance |
Solutions #
Okay, let’s break down how to implement the canonical solution patterns for the Performance & Scalability issues using common load balancer configurations (like Cloud Providers, NGINX, HAProxy).
Remember, load balancer configuration is typically declarative (YAML, text files) rather than procedural code like Java or Rust. However, understanding these concepts is crucial for building robust systems.
Scalable Load Balancing Pattern** #
Problem Solved: Throughput saturation (Packet Processing, Bandwidth, CPU limits on the LB itself).
Core Idea: Don’t rely on a single, fixed-size load balancer. Distribute the load balancing function itself horizontally and ensure ample capacity.
Implementation:
Cloud Providers (AWS ELB/ALB/NLB, Azure Load Balancer, GCP Cloud Load Balancing):
- How: These services are designed for this pattern. They are managed services that automatically scale their underlying capacity based on traffic demand. You don’t manage individual LB instances.
- Configuration/APIs:
- Choose the Right Type:
- AWS: Use Network Load Balancer (NLB) for highest network throughput (millions of PPS), TCP/TLS traffic, and static IPs. Use Application Load Balancer (ALB) for HTTP/S routing, path-based routing, WAF integration, etc. (scales very well but has different capacity characteristics than NLB).
- Azure: Standard Load Balancer scales automatically. Choose based on L4 vs L7 needs (Application Gateway for L7).
- GCP: Choose appropriate type (e.g., Global External HTTP(S) LB, Regional Network LB). They scale managedly.
- Backend Autoscaling: Ensure your backend instances (EC2, VMs, Containers, Functions) are managed by an Autoscaling Group (ASG) or equivalent (Managed Instance Group, VM Scale Set). The load balancer automatically adds/removes targets as the group scales.
- Monitoring: Monitor the load balancer’s own metrics (e.g., AWS
ConsumedLCUs
for ALB,ActiveFlowCount
for NLB, AzureSNAT Connection Count
, GCP LB metrics) to understand usage, though scaling is generally automatic up to service limits. Request limit increases if needed. - Headroom: Configure backend ASG policies to scale proactively based on metrics like CPU, request count per instance, or queue depth, ensuring there’s always spare backend capacity before the load balancer needs to throttle.
- Choose the Right Type:
Self-Managed (NGINX, HAProxy):
- How: You need to run multiple instances of NGINX/HAProxy and distribute traffic to them.
- Configuration/APIs:
- Distribution Layer:
- DNS Round Robin: Simplest but uneven distribution and slow failover.
- Cloud Network LB: Place an AWS NLB, Azure Standard LB (L4), or GCP Network LB in front of your NGINX/HAProxy instances. The cloud LB handles scaling and distribution to your LB tier.
- BGP/ECMP (On-Prem/Advanced): Use routing protocols to advertise the same IP from multiple LB instances, letting routers distribute traffic (requires network team involvement).
- High Availability (HA): Use tools like
keepalived
(VRRP) for active/passive or active/active setups if not using a higher-level distribution method like a cloud NLB. - Instance Scaling: Run your NGINX/HAProxy instances within an Autoscaling Group (if in the cloud) so the LB tier itself can scale.
- Configuration Example (Conceptual - using Cloud NLB):
[ Internet ] -> [ AWS NLB / Azure LB / GCP Network LB ] -> [ ASG of NGINX/HAProxy Instances ] -> [ Backend App Servers ]
- NGINX/HAProxy Config: The individual configs don’t change much for this pattern, it’s about deploying multiple copies effectively.
- Distribution Layer:
Connection Management Pattern** #
Problem Solved: Connection exhaustion (running out of connection table entries, ephemeral ports, or file descriptors).
Core Idea: Tune operating system and load balancer limits, manage timeouts effectively, and potentially limit connections per client.
Implementation:
Cloud Providers:
- How: Limits are often high but exist. Focus on timeouts and choosing the right LB type.
- Configuration/APIs:
- Check Service Quotas: Understand the documented connection limits for your chosen LB type (e.g., AWS ALB/NLB limits, Azure LB limits).
- Idle Timeouts: Configure appropriate idle timeouts. Longer timeouts hold connections open; shorter timeouts reclaim resources faster but might break long-polling/WebSockets. (e.g., AWS ALB
idle_timeout.timeout_seconds
, Azure LB Idle Timeout). NLBs are generally better for very long-lived connections. - Client IP Preservation: Ensure backend servers see the actual client IP (using Proxy Protocol or X-Forwarded-For) if they need to manage connections based on client.
Self-Managed (NGINX, HAProxy):
- How: Direct control over OS and LB parameters.
- Configuration/APIs:
- Operating System Tuning (
sysctl.conf
,limits.conf
):net.core.somaxconn
: Max backlog of pending connections (e.g.,65535
).net.ipv4.tcp_max_syn_backlog
: Max SYN queue size (e.g.,65535
).fs.file-max
: System-wide max file descriptors (e.g.,200000
or more).nofile
(viaulimit
or/etc/security/limits.conf
or systemd unit file): Max file descriptors per process (e.g.,65535
or1048576
). Connections consume file descriptors.
- NGINX (
nginx.conf
):worker_processes auto;
(Usually set to number of CPU cores).worker_rlimit_nofile 65535;
(Set worker process file descriptor limit).events { worker_connections 16384; }
(Max connections per worker. Total ≈worker_processes * worker_connections
, limited bynofile
).keepalive_timeout 65;
(Adjust idle connection timeout).keepalive_requests 1000;
(Max requests per keepalive connection).client_header_timeout
,client_body_timeout
,send_timeout
: Tune client timeouts.- Client Limiting:
http { limit_conn_zone $binary_remote_addr zone=addr:10m; # Define zone based on client IP server { ... location / { limit_conn addr 10; # Limit to 10 concurrent connections per IP } } }
- HAProxy (
haproxy.cfg
):global
:maxconn 100000
(Global max concurrent connections).tune.maxaccept 100
(How many connections to accept at once).ulimit-n 200050
(Set nofile limit directly).
defaults
orfrontend
:maxconn 50000
(Per-frontend limit).timeout client 30s
(Adjust client idle timeout).timeout server 30s
(Adjust server idle timeout).timeout connect 5s
(Backend connection attempt timeout).
- Client Limiting:
backend app stick-table type ip size 1m expire 30s store conn_cur # Store current connections per IP tcp-request connection track-sc1 src tcp-request connection reject if { sc1_conn_cur ge 10 } # Reject if IP has >= 10 connections
- Operating System Tuning (
Optimized TLS Pattern** #
Problem Solved: SSL/TLS processing bottlenecks (high CPU usage during handshakes).
Core Idea: Use hardware acceleration where possible, optimize cipher selection, and enable session reuse mechanisms.
Implementation:
Cloud Providers:
- How: Leverage built-in optimizations and hardware acceleration.
- Configuration/APIs:
- Hardware Acceleration: Usually handled automatically by the managed service.
- Security Policies: Select predefined TLS Security Policies (e.g., AWS
ELBSecurityPolicy-TLS-1-2-Ext-2018-06
, Azure App Gateway SSL Policies) that balance security and performance. Avoid legacy policies unless absolutely required. Check cipher suites included in the policy. - Session Resumption: Typically enabled by default (Session IDs/Tickets). Verify in documentation if specific controls exist.
- Protocols: Ensure modern protocols like TLS 1.2 and TLS 1.3 are enabled in the policy.
Self-Managed (NGINX, HAProxy):
- How: Explicit configuration of ciphers, protocols, session caching, and potentially linking against hardware acceleration libraries.
- Configuration/APIs:
- Hardware Acceleration: If your hardware supports it (e.g., Intel QAT, specialized cards), you may need specific builds of NGINX/HAProxy or OpenSSL compiled with engine support. Consult hardware/software documentation.
- NGINX (
nginx.conf
- withinserver
blocklisten ... ssl
):ssl_protocols TLSv1.2 TLSv1.3; ssl_prefer_server_ciphers on; # Example: Modern cipher suite prioritizing AEAD ciphers ssl_ciphers 'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384'; ssl_ecdh_curve prime256v1:secp384r1; # Efficient curves # Session Cache (Session IDs) ssl_session_cache shared:SSL:10m; # 10MB cache (~40,000 sessions) ssl_session_timeout 10m; # Session Tickets (Alternative/complementary, keys need rotation) ssl_session_tickets on; # ssl_session_ticket_key /path/to/ticket.key; # Needs secure generation and rotation
- HAProxy (
haproxy.cfg
- withinbind
line):# Global or frontend/defaults section tune.ssl.default-dh-param 2048 ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305 ssl-default-bind-options ssl-min-ver TLSv1.2 # Enforce minimum TLS version frontend myfrontend bind *:443 ssl crt /path/to/cert.pem alpn h2,http/1.1 # Add ciphers/options here if not default # Session resumption enabled by default, session cache automatic
Coordinated Scaling Pattern** #
Problem Solved: Uneven scaling across tiers in a multi-layer LB architecture (e.g., external LB scales but internal LB doesn’t, creating a bottleneck).
Core Idea: Link or coordinate the scaling actions across different tiers based on relevant, potentially cross-tier, metrics.
Implementation: (This is more architectural and automation-focused than specific LB config)
Cloud Providers:
- How: Use cloud monitoring and autoscaling features, potentially with custom logic.
- Configuration/APIs:
- Tiered Monitoring: Set up monitoring (CloudWatch, Azure Monitor, Google Cloud Monitoring) for key metrics at each tier (e.g., Edge LB connections/throughput, Internal LB active connections/latency, App Server CPU/Memory/Queue).
- Independent Scaling: Ensure each tier (Edge LB Target Group ASG, Internal LB Target Group ASG, App Tier ASG) has its own autoscaling policies based on its most relevant metrics.
- Cross-Tier Triggers (Advanced):
- Use custom metrics or Lambda/Functions triggered by alarms. Example: If Edge LB
ActiveFlowCount
> Threshold AND Internal LBTargetResponseTime
> Threshold, trigger scaling of the Internal LB’s ASG or even the App Tier’s ASG. - AWS: CloudWatch Metric Math can create combined metrics for alarms. Step Functions or Lambda can implement more complex scaling logic.
- Azure: Azure Monitor action groups can trigger Azure Functions or Logic Apps.
- GCP: Cloud Monitoring alerting can trigger Cloud Functions or Pub/Sub.
- Use custom metrics or Lambda/Functions triggered by alarms. Example: If Edge LB
- Capacity Headroom: Over-provision slightly or set scaling thresholds lower at downstream tiers so they can react before becoming bottlenecks when an upstream tier scales up traffic.
Self-Managed (NGINX, HAProxy) + Orchestration (Kubernetes, etc.):
- How: Combine LB metrics with orchestration platform scaling.
- Configuration/APIs:
- Metrics Exposure: Ensure your LBs expose detailed metrics (e.g., using
nginx-prometheus-exporter
, HAProxy’s built-in Prometheus endpoint, or stats socket). - Monitoring System: Scrape metrics using Prometheus or similar.
- Kubernetes HPA: Use Horizontal Pod Autoscalers.
- Scale LB deployments based on their own CPU/Memory or custom metrics (e.g., active connections scraped by Prometheus Adapter).
- Scale backend application deployments based on their CPU/Memory, custom metrics (e.g., RPS per pod), or external metrics (like queue length).
- Custom Controllers/Operators: For complex coordination logic (similar to the cloud Lambda approach), a custom Kubernetes operator might observe metrics across different deployments/services and adjust HPA targets or replica counts directly.
- Service Mesh: If using a service mesh (Istio, Linkerd), leverage its observability and potentially traffic splitting capabilities to manage scaling and load distribution internally. The external LB still needs to scale appropriately.
- Metrics Exposure: Ensure your LBs expose detailed metrics (e.g., using
There's no articles to list here yet.