Serializing to Parquet from Kafka with Exactly Once Guarantee

In the process of building our new analytics pipeline, we had to implement a typical lambda architecture. A lambda architecture is where you need to implement one massive process flow for real-time data and another for batch for the sake of performance. Mainly because the optimization points for real-time and batch are almost entirely unique to each other. Throughout our process of designing and improving the batch portion of our new lambda architecture, we faced many challenges and learned helpful lessons. We hope a summary of these takeaways will offer helpful insight to those implementing their own batch pipelines.

Continue Reading

Riding the Riptide

The Problem

Normally when we talk about traffic at a CDN, we think about pushing data out to customers. While this seems like it should just involve straightforward transmissions, it turns out things can be a bit more complex.

Continue Reading

Malloc Lives In Userspace (and why you care)

Python, Ruby and Java handle memory management for us. Containerization and “serverless” operation let us do more and more on a single server, and work at ever higher levels of abstraction. But they don’t actually make the server underneath go away and the nitty gritty details of kernels, memory allocation, and storage can still ruin your day. Learn a few lessons about memory based on experiences running a CDN at a scale large enough to serve Tbps of traffic.

Continue Reading