As traffic volumes delivered by our CDN continue to increase, we continue to expand the footprint of our CDN both in terms of number and size of Points of Presence (PoP). As the sizes of PoPs grow, it is important to distribute load across their servers, in order to maintain their health and stability and to guarantee resilience to ephemeral traffic spikes.
Current load balancing system
Requests arriving at a PoP are load-balanced to different servers based on the request URI.
When a request arrives at a PoP, a front-end server processes the request and maps the request to a cache server, based on the requested URI. Consistent hashing (or, to be precise, Rendezvous Hashing achieved using CARP) guarantees that the same resource will always be mapped to the same cache server, as long as that server is healthy. This consistent mapping is crucial for performance, as directing a request to a server that does not have the requested resource in its cache will require fetching the resource from another server, increasing response latency. Importantly, consistent hashing also achieves a fairly uniform distribution of resources across all cache servers in the PoP, helping distribute the workload. However, different resources vary in popularity, and a server that serves a very popular resource can end up serving much more traffic than others in the PoP. For that reason, when a resource rapidly becomes very popular and starts using up a considerable fraction of a server’s network or CPU resources, a mechanism called Hot Filing kicks in. Hot Filing quickly replicates this resource on one or more additional servers and notifies the front-ends about which servers the replicas are on, resulting in a wider distribution of the load generated by the resource.
Potential room for improvement
The logic that triggers Hot Filing today is based on fixed thresholds which guarantee that no single resource is responsible for more than a predetermined rate of requests or bytes served by a server. If a resource load exceeds that threshold, it is distributed further. However, a challenge with fixed thresholds is that if a considerable fraction of a server’s load on a server is generated by resources that are just below that threshold, those resources will not be Hot Filed, and so their respective request load will not be distributed. Because of this, we have observed that even with the combination of resource-based load balancing and Hot Filing, the load distribution across servers in a PoP can often be uneven, with some servers delivering more than 2-3x the median load.
Uneven network load distribution is common in many PoPs at any given time, with some servers delivering up to 2x more traffic than the median.
This unevenness is not always a problem because servers can remain performant as long as they are not reaching their capacity. However, in extreme cases, the impact can be very apparent. The following figure illustrates a real example of this effect.
Two servers reaching or exceeding their network capacity, while the rest of the PoP is at delivering lower volumes of traffic. The outlier servers suffer obvious inflation on health check (response times).
In this snapshot of network load distribution at a particular PoP, while most of the servers are sending low volumes of traffic, there are two servers that are serving more than 2x the median load, reaching their capacity, due to one or a few very popular resources. This has an observable impact on their health check metric, which is a proxy of their minimum response time. Healthy values for this metric are typically below 1ms, and alerts are raised when they exceed 10ms. In this example, the health check increased to values above 100ms and persisted for at least one hour, during which the overloaded servers were likely performing poorly.
We have also observed cases where a few servers are persistently more loaded than the rest of the PoP for up to several days at a time. During such periods, these loaded servers are generally less resilient to incoming traffic spikes than other servers in the PoP. This is exacerbated during peak hours, as their load can reach or exceed their capacity, although there is available capacity in the rest of the PoP.
Adaptive request load balancing system
Based on these observations, we have been researching the idea of Hot Filing with dynamic thresholds. This approach considers the distribution of load across servers at any given time, as well as where each server lies in that distribution. Based on these conditions, a server-specific threshold is calculated as a function of the server’s place in the load distribution: a server with a load higher than the median is assigned a lower threshold than a server with a lower load, favoring more offloading for servers in the tail of the distribution.
Server thresholds are generated based on each server’s current place in the load distribution. A server with a load higher than the median is assigned a threshold lower than a server with a lower load.
More specifically, we define two parameters that control the level of Hot Filing:
- BaseThresh controls the baseline value for each server’s threshold. A server-specific threshold is derived from this value, adjusted for the server according to its current load.
- α ∈ (0, 1), controls how aggressively the algorithm will adjust weights for servers that are in higher need of offloading.
Then, for each server in the PoP we generate a weight W(s) ∈ (0, 2), which is inversely proportional to the server’s current load, using the formula:
- α ∈ (0, 1)
- BW(s) is the current server load
- BWmin is the lowest load server’s load
- BWmin is the highest load server’s load
Then, each server’s threshold T(s) is calculated as:
In our implementation, we configure BaseThresh to a value fitting our workloads, and we let the algorithm choose the value for α dynamically, such that more aggressive offloading is enforced on outlier (more loaded) servers if those servers are very far from the median in terms of load.
Evaluation using CDN production workloads
We evaluate our approach in simulation, using snapshots of production workloads. In order to measure the (change of) skewness in the distribution of load across the servers in a PoP, we define the “skewness factor”:
In other words, S measures how far away the most loaded server is from the median. For example, S=2 means that the most loaded server is delivering 2x the median load. Ideally, we want all servers to be close to the median, so lower values for S are better (with S=1 being the theoretical optimal).
The figure below shows how S changes over multiple iterations of the Hot Filing process based on dynamic-thresholds, along with the number of new Hot Files generated in every iteration, and the value for α picked in each iteration.
Load distribution after several back-to-back iterations of the algorithm. Load on the most loaded servers is reduced from 2.73x the median value 1.42x the median value, and traffic within the PoP becomes more evenly distributed.
The blue line (“start”) shows the starting state, which represents a real snapshot of the load distribution at a PoP. We note that S is decreased after each iteration until a point is reached where no more Hot Files (HF) are generated. As the tail is trimmed in each iteration, the distribution becomes more even, with more servers closer to the median load Next, we repeat the same experiment 10 times for 6 different PoPs:
Skewness factor (S) change for 6 PoPs. S converges in 5 iterations, decreasing by 92% on average.
Each group of bars in the figure represents a different PoP, and each bar within a group represents a subsequent iteration. The first bar in each group represents the starting state, drawn from production work-load snapshots. Each bar represents the values for 10 runs. We observe that in all cases, the decrease in S is much more dramatic in the first iteration than in any of the following iterations in which S is reaching more acceptable values. Importantly, the spread of the distribution (illustrated by the whiskers), as well as the outliers (diamonds), are reduced similarly. We observe that the load balancing mechanism converges after a small number of iterations, and would only need to be triggered when S becomes high due to new unbalanced traffic.
Distributing load across servers in a PoP is important for performance. While some degree of unevenness is expected due to rapidly popular resources, persistently overloaded servers despite available capacity in the rest of the PoP could suffer an impact on their performance and their resilience to more incoming traffic spikes. In this work, we explored an enhancement to our existing load-balancing mechanism which can help mitigate such load unevenness.
Results from simulations have been promising, so we are in the process of baking this change into our existing mechanism and monitoring the results in production. Testing this method in production will allow us to both get more realistic results and to evaluate additional factors that are harder to quantify in simulation. Such factors can include resilience to highly dynamic workloads, time of day effects, change in the level of resource replication as well as the associated overhead.