Anycast is widely used by Content Delivery Networks (CDNs) and for the Domain Name System (DNS) to efficiently route service to clients from multiple physical points-of-presence (PoPs).
Anycast depends on Border Gateway Protocol (BGP) routing to map users to PoPs. Therefore, its efficiency depends on both the CDN operator and the routing policies of ISPs on the path. Such a distributed environment makes detecting and diagnosing inefficiency challenging.
To overcome this challenge, we at the University of Southern California / Information Sciences Institute in collaboration with Verizon Media Platform, have designed and been testing the effectiveness of Bidirectional Anycast/Unicast Probing (BAUP) as a diagnostic technique for operators who use anycast infrastructure to improve the latency of their services.
In our study, we applied BAUP to the Verizon Media Platform CDN, and in doing so, significantly reduced its median latency by more than half (40ms to 16ms) for a set of regions with more than 100K users.
How BAUP works
BAUP involves probing in both forward and reverse directions between vantage points (VPs) and the CDN by using both the unicast and anycast address of the CDN. A simple illustration of applying BAUP is shown in Figure 1.
Figure 1 — BAUP launches probes from VPs to both the anycast and unicast address of the CDN, and launches probes from the anycast and unicast address of the CDN to the VPs.
There are two steps to using BAUP:
First, operators should use a group of VPs (simulating the client side) to probe both the anycast and unicast addresses of the CDN. Second, for VPs that see large differences in anycast probing and unicast probing, operators can launch traceroute to further diagnose the whereabouts of the high delay. In the first step, the latency in the anycast probing is the latency currently experienced by users, since it is towards the anycast address of the CDN. If we are able to find a smaller latency in unicast probing, this means there exists an alternative path that has lower round-trip time than the current one.
In the second step, we launched traceroute bidirectionally between (VP, CDN_anycast), (VP, CDN_unicast), and (CDN_unicast, VP). By looking at the traceroute result, we were able to find problematic routes on the path if there was an abnormally-high hop-to-hop delay in anycast probing. Moreover, the unicast probing with shorter delay gives operators some visibility to how to sort the problem out.
BAUP reduces latency by half
While BAUP can help diagnose the problem of high latency in general anycast infrastructure, we’ve used it as a diagnostic method on the Verizon Median Platform CDN — we used about 10K probes that are publicly available from RIPE Atlas as the VPs.
We first identified latency differences between anycast probing and unicast probing, and found that 1.59% of VPs had much shorter latency in unicast probing — a point of interest for further study for potential latency improvements.
We then sought to identify the problematic hops, from which we found three Autonomous Systems (ASes) that were adding to the delay between the VP and the CDN_anycast, and were affecting tens of VPs. Out of the three ASes, we found one AS is adjacent to the CDN and can be managed relatively easily — we name this AS, AS-H.
Two primary routing options are available to the CDN operator in this case. First, it could withdraw announcements to this AS-H completely, but that risks leaving some clients of this AS with poor connectivity to the CDN. Alternatively, the CDN could use a community string to request that the peer refrain from propagating the CDN’s anycast router to the peers of this AS-H, preventing traffic from more remote networks from using this path.
After applying the second option, we found the vast majority of VPs that had previously passed through AS-H saw a significant reduction in latency. The median latency dropped from 40ms to 16ms. For these VPs that passed through AS-H, their locations are across 91 ASes in 19 economies, covering more than 100,000 users. This large improvement was in tail latency — before the improvement, our observers in the 100,000 users showed median latency at the 86th percentile of all users.
We recommend BAUP to operators that use anycast infrastructure to deploy services. This study was a joint collaboration between myself, Lan Wei, and John Heidemann from the University of Southern California / Information Sciences Institute, and Marcel Flores and Harkeerat Bedi from Verizon Media Platform. Read our paper and watch a recording of our presentation at TMA 2020 below.
The original version of this post appeared on the APNIC blog, special thanks to the APNIC team.