Imagine a world where the rules of big data processing are being rewritten before our eyes. That’s exactly what Pinterest is doing with Moka, its groundbreaking platform that’s shifting the paradigm from aging Hadoop systems to a Kubernetes-based future. But here’s where it gets controversial: can Kubernetes truly replace Hadoop as the backbone of large-scale data processing? Pinterest thinks so, and their journey is both a roadmap and a cautionary tale for the industry.
In a revealing two-part blog series, Pinterest’s Big Data Platform team—Soam Acharya, Rainie Li, William Tom, and Ang Zhang—detail their transition from their legacy Hadoop system, Monarch, to Moka. This isn’t just a tech upgrade; it’s a strategic pivot toward a cloud-native, Kubernetes-driven architecture on Amazon EKS. Part one of the series dives into the overall design and application layer, while part two explores the infrastructure-focused aspects, complete with lessons learned and future directions. The move isn’t just about adopting Kubernetes—it’s about treating it as the control plane for data, a shift that’s gaining traction across the tech industry.
But why Kubernetes? The team evaluated alternatives based on scalability, security, cost, and multi-engine support. Moka emerged as the solution, seamlessly integrating Apache Spark as the primary engine while laying the groundwork for other frameworks like Flink and Apache Ray. This isn’t just about modernizing—it’s about future-proofing. Pinterest’s approach demonstrates how to evolve a Hadoop-era platform without abandoning existing investments in Spark. And this is the part most people miss: Moka isn’t a Spark-only platform; it’s a flexible foundation designed to accommodate diverse processing engines based on workload needs.
One of the standout features of Moka is its focus on observability. By integrating logging with Fluent Bit, metrics with OpenTelemetry, and job history services, Pinterest’s engineers created a unified view of system health. This allows teams to debug and optimize jobs without getting bogged down in cluster complexities. But here’s the bold question: is this level of abstraction too good to be true? While it simplifies operations, does it risk isolating engineers from the underlying infrastructure? Let’s discuss in the comments.
Pinterest also tackled the challenge of multi-architecture support, ensuring Moka runs efficiently on both Intel and ARM-based instances, including AWS Graviton. This aligns with broader industry goals of cutting infrastructure costs without compromising performance. As InfoQ editor Eran Stiller noted, Moka delivers container-level isolation, ARM support, YuniKorn scheduling, and significant cost savings—a testament to its role as a reference architecture for cloud-native data systems.
But here’s where it gets even more intriguing: Moka’s success with Spark has paved the way for other engines like Flink Batch and Apache Ray. This highlights a critical industry debate: Spark vs. Flink. While Spark excels in batch and interactive analytics, Flink is purpose-built for real-time, stateful stream processing. Pinterest’s approach? Don’t choose sides—build a platform that supports both. This flexibility is Moka’s superpower, but it also raises questions about complexity and resource management. What’s your take?
External reactions to Moka have been overwhelmingly positive, with the ML Engineer newsletter praising its innovative use of EKS clusters, logging, metrics pipelines, and a custom UI. Yet, Pinterest’s team emphasizes that this is an ongoing journey, not a finished project. Their phased migration from Hadoop to Kubernetes involved “working out the kinks” as they shifted real workloads. The real lesson here isn’t the tech stack—it’s the process. Uncoupling from legacy systems, investing in observability, and embracing multi-engine support are the hard parts. For other organizations, this might be the most daunting—and critical—aspect of modernization.
So, here’s the big question: As Kubernetes continues to reshape the data processing landscape, will your organization follow Pinterest’s lead, or will you stick with legacy systems? And if you’re already on this path, what challenges have you faced? Let’s spark a conversation in the comments—because the future of big data processing is being written right now, and your insights could shape the next chapter.