Metrics from a TCP Listener
SampleCount
- SampleCount of
RequestCount
andLatency
metrics are exactly the same as number of requests. - SampleCount of
SurgeQueueLength
is roughly number of requests doubled. - SampleCount of
EstimatedProcessedBytes
andEstimatedALB*
metrics is number of load balancer nodes at 1 minute periods. - SampleCount of
HealthyHostCount
andUnHealthyHostCount
is a multiple of number of load balancer nodes at 1 minute periods. The multiplier could be one of 18, 60, 66, or some other integer.
Connection Rate
When TCP request rate is stable and no requests fail, sum of EstimatedALBNewConnectionCount
should be twice as big as sum of RequestCount
, because connections established with both clients and targets are counted towards EstimatedALBNewConnectionCount
.
Average Waiting Time in Surge Queue
Warning
This is a thought experiment with GPT-4 Chat Demo. The resulting formula is based on an assumption that is not true.
The 2 times relationship between SurgeQueueLength
samples and number of requests implies that each request triggers 2 samples.
Given that the maximum SurgeQueueLength
we have observed is the per node limit of 1024, we can assume that each sample only takes the surge queue length of that specific node.
Therefore, the total SurgeQueueLength
of the load balancer should be Avg(SurgeQueueLength)
multiplied by the number of nodes, assuming that each node handles the same amount of requests in each period.
Applying Littleβs law to the surge queue, we get
From the CloudWatch data we have analyzed, we can assume that ratio of nodes across Availability Zones (AZs) is the same as ratio of HealthyHostCount
in each AZ.
Therefore, to calculate the average latency per AZ, we could use per-AZ metrics and SampleCount of EstimatedProcessedBytes
(which is not split by AZ) multiplied by .
The resulting formula is
, with the assumption that requests are evenly distributed across nodes.
The Real Situation
Regarding the assumption in the above section, from percentile statistics of EstimatedALBNewConnectionCount
we can see that the distribution of connections across nodes is not even nor stable. To get the total SurgeQueueLength
, each nodeβs samples must be normalized to the same weight, which is not possible from summary metrics.
To get accurate numbers, you need to enable access logs for your CLB and crunch the *_processing_time
numbers.