Colin Rasor usually has crazy, interesting ideas lurking in the back of his head. Ask him, he’ll tell you some.
One of these was a wish for a concrete, reliable toolset that could implement the traffic management concepts he wanted to see most, something that could provide a visual representation of how traffic moved around the network during events, grooming, and outages, both internal and external. Large events present themselves much like the deluge of a rainstorm, a relentless demand for data that has to go somewhere. Being able to visualize both request and output flows gives us a method to manage that chaos, to understand the flow dynamics in detail and make better decisions, to anticipate, forestall, and mitigate failures and poor performance. The end result is lower latency, better reliability, and happy customers. It also means happier architects, engineers, and operations teams.
After pitching these ideas to (then) CTO of Verizon Digital Media Services, Rob Peters, Colin got the green light to assemble a team to make it happen. Collaborating with our R&D teams, he tapped Marcel Flores, a PhD computer scientist with a focus on networking, to turn theories into applied science. To build his toolset, he hired Aleksey Kim, a software developer with a performance engineering and analytics background, and Bill Nash, a network toolsmith with CDN and large network experience. More recently, Prathyusha Ramanjaneyulu has joined Verizon, bringing a Masters degree in Computer Systems Networking to the team. Using the data produced by Clownfish, she spearheads the sometimes complex interactions with operations teams to implement necessary changes to improve network performance.
Team assembled, where do we start?
Verizon’s Edgecast Content Delivery Network is awash in internal data platforms. Spider, one such in-house toolset, offered us API driven access to network data, including the necessary ability to query routing information from the core and border routers in a non-disruptive manner. Stonefish, an in-house DDOS detection framework that samples incoming network requests to detect attacks, provides a custom-tailored analogue to industry standard sFlow/IPFIX technologies. A mixture of other tools provided ancillary supporting data like peering sessions, interface labels, prefix labels.
For our database, we chose Elasticsearch, for a couple of principal reasons: It was fast, and it scaled sideways. This gave us a flexible key/value document store built atop a search engine, true multi-master I/O behavior, robust replication, multi-index search, and the ability to add additional computing and storage transparently while operating. All of that, and no SQL, what’s not to love? Given that we were consuming data from multiple different sources, a rigid relational model brought with it an amount of overhead that just didn’t make sense.
Needing an indeterminate amount of compute resources, we deployed on AWS. This gave us the ability to start small, with the critical ability to quickly add compute and storage resources as we grew. In a few instances, we needed to add resources quite expeditiously to maintain a stable environment as we created new aggregations of Stonefish data, the single largest data set we were ingesting. Mixing the same data in a few different ways led to staggeringly different levels of cardinality for each model.
Keeping within the ELK stack, we’re also leveraging Logstash, an incredibly flexible streaming data toolkit, capable of ingesting data from a number of sources, and sending them back to just as many. Already well adapted to Elasticsearch, the two combined can offer something of a poor man’s Splunk, even at their simplest implementation. With development and refinement, they become something of a Swiss Army Chainsaw in the hands of a motivated developer.
Rounding things out, Kibana takes all of this and puts a pretty bow on it, bolting directly on to Elasticsearch and allowing inspection of collected data both in time-series and in aggregate. This jumpstarts visualization, turning hunches into useful visuals in very short order.
Our web front end is an Apache2/mod_perl Dancer environment (PLACK/PSGI), leveraging Bootstrap and jQuery for quick and almost ridiculously painless implementation of user interfaces, and in keeping with our data practices, a parallel RESTful JSON API. In many cases, we can simply query our native document structures from Elasticsearch and offer them directly to the user or downstream toolsets, without giving them direct access to the database.
Decisions we had to make (and the data that affected them.)
We had to figure out how to query routing data, in a manner that didn’t impact the network. A full routing table is a hefty chunk of bits, and not all routing platforms make querying it simple. The potential for a negative impact on the network existed, so we had to exercise caution.
To do this, we first had to identify and classify BGP sessions, based on their descriptions, so we could differentiate how we queried routes to get both the smallest possible and the most accurate data set. From a traffic management perspective, we also wanted to use our data along meaningful lines, like splitting up private peers from transit peers. We also needed to know if we were looking at external BGP sessions, which may carry full routes, versus internal sessions, which carry our own route announcements.
For Stonefish, we focused on sampled SYN data, which offered immediate value in visualizing it. We wanted all of it, and as soon as possible, while having minimal performance impact on the Stonefish cluster itself. We query Stonefish once every minute for new raw data, keeping our query as simple and lightweight as possible, hitting data that’s still ‘hot’ in the indexes. Once we have the raw data, we apply a few different aggregations to it, to generate a view into TTLs, ASN footprints and destination prefix summaries, which are then pushed into Elasticsearch to provide visual representations. This is pretty effective, with Logstash’s native daily index curation, which lets us easily collect and visualize up to 30 days of historical data.
Stonefish ingest occurs quickly and runs constantly, leaving us barely a minute or two behind real time. For managing traffic during live events or peak periods, this permits for a very reactive footing when changes are required. We get nearly immediate feedback as changes are applied. Routing data refreshes occur a handful of times a day, but with the addition of a reactive automation toolset listening to network events, we will soon move closer to a near real-time footing and query for routing updates whenever changes to network configurations are detected.
Along the way, we also dipped into other internal toolsets and documentation for supplemental information, such as naming standards, device and interface inventories, geolocation markers, autonomous system registries and the like. This pattern of relationships lead to the name of the platform, as the clownfish is a symbiotic creature in a complex ecosystem, working to keep its host anenome clean and healthy.
No design survives first contact with the customer. Over time, iterative and event based changes to routing policies, each symptoms of organic growth or rapid responses to problems, introduced variations to announcements that were not always obvious to engineers, often missing context relating to the ‘why’ of a configuration. As a result, the current configuration may perhaps be in conflict with the intended state for designed function. The problem to be solved in our first pass was exposing these variances between bandwidth providers, and provide a way to observe the impacts as we corrected them.
In the first iteration, we focused purely on transit providers (though we’ve since expanded to our other interconnect classes). Clownfish’s web UI rendered two useful datasets: aggregated summaries of specific prefix announcements, organized by peering class (transit, private, internal services, etc), and summarization of BGP sessions by peer, PoP (point of presence), class, or state. This allowed us to assess both peering health, and synchronicity of route announcements across all egress capacity available in each PoP.
With Kibana, we crafted views into aggregated performance data around useful factors: PoP, autonomous system number, source prefix, and load balanced virtual IPs. This provided a near-realtime set of performance gauges with which to evaluate the necessary changes we recommended as we evaluated the configuration of each PoP’s route announcements. Passing parameters from the UI supported the easy creation of dynamic dashboards and embedded graphs, building rich, informative interfaces with little effort.
Moving into our second iteration, we’re focusing on automation to drive alerting when variance is immediately detected, improving UI elements to provide more information, and refining visual elements like graphs and charts for speed and efficiency. There’s no small amount of tuning involved to make sure the data flows smoothly in and out of the cluster, but when multi-terabit/second events can run smoothly and resiliently as a result, it’s worth every ounce of effort. Additionally, we’ll start making consideration for cost based metrics, adding another complex set of facets to the traffic management decision matrix.
Storage is a funny thing, since you really want to plan that out to make sure you’ve got capacity, something you often don’t think about until it’s gone. That’s the kind of thing that is generally discovered at the most inopportune moment, like 2 a.m.
When we started, we had no idea how many gigabytes a day’s data would add up to, because we simply didn’t know what data we would have to work with. Knowing that we could rapidly scale if the need arose, we started with a basic two node Elasticsearch cluster, which rapidly grew to four. Then we added more storage to each. Then, we added two more nodes. Then two more non-datastore nodes, just for client access. All of this was achieved without (intentionally) taking the cluster offline. Anecdotally, a cluster crash is a good natural indicator of a lack of resources.
Data retention for time-series data currently sits at thirty days, enough time to see live event impacts compared to week-over-week trends. Routing data ages out after a mere seven days, given its relative non-volatility. We have the ability to model and store changes on a longer basis, but currently, there’s no requirement to do so. We’re also exploring platforms like OpenBMP to augment our toolset, giving us insights to events beyond our network edges. Our cluster is stable, at this point, but we anticipate adding another two nodes and doing some tuning as we explore additional supplemental datasets and iterate on the design, making enhancements and applying feedback from other teams. This was a case where both Elasticsearch and AWS really outdid themselves, because we could solve resource issues at the first sign of trouble without stopping work or losing data.
The acid test of any toolkit is how quickly it lets you go from hunch to knowledge, uncertainty to fact, and surprise to solution. We definitely feel that we’ve hit that goal with this project. Anyone on the operations front can easily see changes in traffic patterns across locations, and correlate that information to impacts. Most notably, we can see the effects of our optimizations in not only our own toolsets, but third party RUM and synthetic testing systems. In short order, we made significant improvements to our global network latency and reliability, and people noticed.