Normally when we talk about traffic at a CDN, we think about pushing data out to customers. While this seems like it should just involve straightforward transmissions, it turns out things can be a bit more complex. This can be especially true in systems with a global footprint, where the platform is made up of data centers that span the entire world. To understand exactly what we mean, we need to dig in to some of the underlying behaviors and protocols that drive the CDN.
So first, let’s think about behaviors. Often, cache contents must be shifted between data centers to perform the best client delivery. The files transferred can vary greatly in size, depending on the customer and the associated traffic. These transfers might also span the globe: data centers that are very far from each other may need to communicate regularly, incurring very high latencies. 1 Finally, there’s almost always dependencies for other communications: some other task (e. g. delivering the files to actual clients) can’t complete until they finish.
Data centers all over the world must frequently exchange data and messages.
What about the protocols? Since we need these connections to be reliable, we use TCP for almost all of this datacenter-to-datacenter communication. Unfortunately TCP comes with a few overhead costs that make the above scenarios particularly tricky. Let’s dig into those a bit.
The first issue relates to how TCP establishes new connections. To safely begin connections, TCP employs a technique called a three-way-handshake. This technique involves a client sending an initial message, called a syn. The server then responds with a syn/ack packet to acknowledge the syn. Finally, the client sends a single ack to finish the handshake. Either side can then begin sending data. While this method works extremely well for establishing a proper reliable transport link, it can get a bit expensive when you consider a client and server which are very far apart. Indeed, the delay caused by performing the handshake across continents could take hundreds of milliseconds.
The second challenge comes from how TCP enters a network. To keep from overwhelming the underlying network, the fundamental design of TCP regulates the amount of traffic it sends with a process called slow start. With slow start, TCP only allows a sender to put a little bit of traffic into the connection before hearing a response back, usually about 15KB (10 segments) in the first window on default Linux. This 10 segment limit is called the initial congestion window (Initial CWND). While safe, this means that if you are trying to send a file that’s just a little bit bigger (say 16KB), you’ll have to wait at least two whole round trip times (RTTs) to complete the transfer.
A file that won’t fit in the initial window will require a second RTT, potentially adding significant delay.
Finally, you have to remember that all this lives inside a global network that is mostly focused on delivering traffic quickly to clients. That means that we don’t necessarily have the same configuration freedom that one would get when optimizing a dedicated set of machines. We therefore want to keep our solution as simple as possible, and avoid things like custom kernel modifications, or other difficult to manage configurations. In other words, we can’t easily just scrap TCP and start fresh.
So what can we do? Well, there are a few “simple” ideas that sound good at first pass, but ultimately are untenable. For example, what if we just held all the connections open? After all, if we are communicating between data-centers all the time we have probably already opened a connection at some point. While this would work great in theory, existing connections would avoid the need for handshakes and slow-start, in the real world this solution is not so great.
First, always holding connections open indefinitely means we would eventually approach what we call a “Full Mesh” of connections — every server in every data center would have an open connection to all other data centers. While that avoids all the startup time penalties, it wastes a tremendous amount of resources, allocating buffers for all those connections, which might go unused for much of the time. Moreover, this solution isn’t even practically possible: regular application and network errors mean these connections often have to be closed, forcing us to suffer the original penalties.
Another possibility is to just effectively override slow start by allowing new connections to open with big initial congestion windows, say big enough to let the majority of our transfers finish in one RTT. We would still pay the handshake penalty, but could still save some time 2.
The distribution of RTTs necessary to complete file transfers for different initial congestion window sizes using the distribution of files transferred in our network. Increasing the window to 50 segments allows 30% more files to complete in 1 RTT.
But this comes with some serious risks: if we allow the servers to put too much traffic onto the network at once, they might overwhelm the underlying network (what slow-start is designed to avoid in the first place) during a big wave of traffic. While traditional connections might develop big congestion windows over time, they do so with feedback from the network. Here they would just dive off the deep-end, risking congestion in the network.
Based on the above constraints, we really want a solution that is going to be flexible and lightweight. Since we are still communicating over the wider internet and are committed to using TCP, we want to wisely use the resources available to our servers, but still be sensitive to the underlying network.
So what does Riptide do? The key is that it is based on observations. Riptide looks at the current conditions achieved by existing connections, and sets the initial congestion windows to match those values. Essentially, it subverts slow-start only as much as existing connections suggest is safe.
The new congestion windows are based on observations of existing connections to the same destination.
More specifically, Riptide will first look at the congestion windows achieved
by a server to other data centers. It does so by using the Linux tool
specifically by parsing the output of:
It will then set new connections to each remote data center to use an initial
congestion window that is the average of these current windows (using a moving
average of historical values). It sets this value using Linux’s
ip route tool
and the initcwnd flag:
ip route add <remote_ip> dev eth0 proto static initcwnd <calculated window> via <default gateway>
In our implementation we check and set these values once per second, and expire values that were set more than 10 minutes ago, if we saw no new connections.
And that’s it! It all lives in user space, is wrapped up in a Python script, and can easily be run anywhere!
So how well does Riptide work?
In order to measure that it was working, we wanted to see two things. First, we wanted to know that the congestion windows of our connections got bigger. Indeed, without that, we wouldn’t expect any changes. So, the first thing we did was measured the windows before Riptide was running and after. We also varied the maximum size of an initial window that Riptide was allowed to set. Here’s what we saw:
Observed congestion window sizes for different Riptide maximums. Here we see that even with a maximum of 50, the median window size increased by 100% to 50.
Here the label on each line tells Riptide the biggest window it’s allowed to set: the line labeled 50 means that Riptide could only set an initial window of 50 segments (though it could organically grow larger than that if the traditional TCP processes saw fit). We see that the windows got quite a bit bigger, suggesting that Riptide was allowing more data to enter the network. We also saw that there were diminishing returns in letting the system be more aggressive than initial windows of 100.
What about the transfer speeds? It’s certainly possible that the windows got bigger, but didn’t ultimately help anything (i.e. the window size increase is a necessary but not sufficient condition). So next, we considered the completion time of probe traffic traversing these datacenter to datacenter connections. While we considered a number of conditions, let’s just take a look at the most interesting case: when the data centers are very far (>150ms in the network) apart and the probes are relatively large (100KB) relative to the default windows.
In the median case, Riptide was able to reduce transfer times by nearly 200ms, a human detectable difference.
Here we see that the transfers were often quite a bit faster, with the probe transfers regularly able to shave off an entire RTT. Anyone interested in more details on our findings, as well as a more thorough description of the process and parameters should take a look at the paper.
Overall, Riptide was able to address a notable operations challenge in our network, without having to drastically alter any system behavior, or underlying implementations. These kinds of optimizations allow us to keep our network flexible, while still improving many of our critical functionalities.
As of 2014, this was known to be popular with CDN providers, as seen here and more recently here. It’s important to note that these cases examine the connections to end-users. Riptide focuses only on datacenter-to-datacenter communications. ↩