How Verizon Digital Media Services not only fixed a recent backbone provider issue for our customers by removing bad transit, but continued to work on the issue after our sites were secure in order to help identify the root of the problem and restore the backbone of the Internet for all.
[Update: to clarify some discussion in threads discussing this post: the EdgeCast content delivery network is owned by Verizon, but operates our own AS 15133, distinct from the Verizon backbone AS 701. EdgeCast connects with multiple Tier 1 transit providers (of which the Verizon backbone is just one) at each of our datacenters in order to provide the best availability and performance to our customers. The situation described in this post relates to a link inside one of the Tier 1 providers (which happened not to be Verizon in this case) used by EdgeCast.]
This past week our platform flagged a pattern of increased errors.
Our internal metrics indicated a number of errors when attempting to connect to our customer origins, an alarming sign. We identified a set of affected customers and began launching tests to those affected origins across our global platform in order to try and narrow down the problem. Our tests showed that the issue was only affecting 0.6% of our fleet and was most visible on SSL connections.
The servers that we identified as being affected were consistently failing to download from customer origins. As a first step of troubleshooting we reviewed our Change Management logs to ensure we had not made any changes that could have adversely affected our network. After reviewing our logs we came to the conclusion that none of the changes that had been made during that time period could have caused the behavior we were seeing.
Next we took the list of affected servers and dug into the problem with what we like to call the Swiss Army Knife of network troubleshooting: tcpdump. Looking through the packet traces we noticed that there was some corruption on the wire — the TCP checksums on some of the packets did not match. This was a clear sign of corruption.
Here is some data showing a large amount of packets with invalid checksums coming in on the network.
Upon closer inspection of these packet captures we saw that some of the bits were getting flipped. In this case “Machine” became “Macxine”. This sort of behavior can go unnoticed or result in strange behavior for unencrypted HTTP sessions where the TCP 16-bit checksum is inadequate to detect multiple bit errors, but in the case of SSL we see these show up as errors as the payload’s integrity is guaranteed by the cryptographic layer.
In order to see if there was a pattern to this behavior, we ran some additional tests with a uniform payload. What we got back was a consistent changing of a bit from zero to one, always at byte offset 13 within some multiple of 16 bytes. The particular multiple of 16 bytes modified, as well as how many were corrupted, varied between packets. So we confirmed that we had packet corruption caused by bit flipping but we still had to track down the source of these bad packets.
The fact that we were seeing the flipped bits at the same offset was a hint that some component along the way was experiencing a hardware problem. Most likely from bad memory in some networking device, perhaps our servers, a network card, a switch or a even a router.
To see if were dealing with a host-level problem, we started testing from a secondary IP address on our servers. We observed something interesting: although packets were arriving at the primary IP address with failed checksum, the secondary IP address didn’t have any issues. Although this was not concrete proof, it indicated to us that the culprit was likely not the host itself.
Once we ruled out the host, we turned our focus to our data centers. Could it be a one of the routers of switches within our data center? We checked to see if there were any switches in common with the servers experiencing the issue and found no commonality. So we then had to turn our attention to our routers.
There are many ways to determine whether or not a router is involved, and if you have enough traffic, or sufficient sampling, it is also possible to use flow data to track down such an issue. Flow analysis shows us both the outbound and inbound interfaces taken by the packets destined to the servers experiencing the bit flipping. Using the results of our flow analysis we were able to see that the servers experiencing the issue had a backbone provider in common for their return traffic.
With that finding we developed a new hypothesis: that somewhere, somehow these packets were getting mangled by the backbone provider on their way back into our network. In order to test this latest hypothesis we decided to run an experiment by shifting traffic to other providers. We ran the experiment and, as we did, the errors and bit flipping vanished, proving that it was a transit problem after all.
At this point we knew that we were not out of the woods yet. This issue was much larger than it initially seemed, and it was likely impacting a large number of clients on the internet. We reached out to some of our impacted customers and asked them to send traceroutes from their networks to ours so we could attempt to identify where the problem might lie. Slowly the picture started to become clearer. We realized that the problematic network paths all seemed to pass through Ashburn, VA.
©Ruslan Enikeev, Internet-map.net, 2012
Now we had a location at which to drill down into the transit provider’s network with our own testing. We ran full-mesh tests and asked all of our servers in the DC area to reach out to our servers in Helsinki. This would enable us to find a pair of our own servers that go through that same router noted by our customers and reveal packet corruption problems through the transit’s path.
At this point we engaged the transit provider and started sharing the evidence. Given the volume of traffic that we serve on the Internet this transit provider took the issue very seriously. The urgency was immediately recognized and it was escalated to their top engineers, who ran some tests.
At our request, the transit provider statically routed the flow with packet corruption and forced it to take specific routes in their network in Ashburn. This revealed that the problem resembled:
Then they tried the second route and packet captured showed the issue went away:
We were then confident that, somehow, “Router C” was involved in the packet corruption, but we could not be sure if the issue was from A -> C or C -> Cloud. We ran another trace (A->C-B) and noticed it was still failing:
And this last test told us the issue was from A-> C.
The senior engineer assigned to work with us was patient enough to keep troubleshooting, and eventually determined the uplink between A-> C that was the cause. We started routing over different link bundles between the two routers until we identified which link bundle was causing the issue:
We even went one step further by disabling individual links inside that bundle, where we finally identified the culprit.
Once the underlying issue was identified, we nailed it down to a particular LAG. We were then able to isolate the problem to a single port in the transit provider’s backbone router.
As a temporary fix to the problem the transit provider disabled the problematic interface and contacted their hardware vendor in order to determine the best permanent fix to the problem. This was most likely a case of bad hardware (or memory unit) and could be resolved with a line card replacement.
Many of us working on the Internet Infrastructure have seen a bad LAG, or a bad port. But to see a port that is sending traffic through just like the others, however making changes to the packets was something new.
The lesson learned here is that it is important to monitor and trend checksum failures and other hardware health indicators across your infrastructure. This plays a critical role in being able to quickly detect and resolve these black swan network events that would otherwise go unnoticed.
So far there is no good way to detect packet corruption at the transit level and again these sorts of issues, while uncommon, can have a large area of effect, so this is a great place for vendors to focus on developing new detection mechanisms and products. If you’re with a network vendor or your company develops tools to detect network problems please think about including these sorts of use cases and scenarios in your product development. It’s not only good for your customers but for the whole of the Internet.
In conclusion we learned that while only a small subset of Verizon’s customers were impacted, so was anyone else going over this part of the Internet’s backbone in Ashburn, VA. We are happy we were able to step in and help resolve the issue.