Guest Writer

July 2, 2024

The coming eBPF revolution and why Kubernetes monitoring will never be the same

Abstractions always require tradeoffs.

With Kubernetes, DevOps teams can run applications across a wide variety of scales without having to think as much about resource allocation and autoscaling. But there’s a tradeoff: Kubernetes and the level of abstraction it calls for introduces a lot of complexity.

Complexity is high-risk. If you can’t monitor your environment well or track issues effectively, small flaws can hide and fester, eventually turning into costly problems. With cloud-native environments and microservices architectures, it can be especially difficult to manage logs, metrics, and traces, making these flaws and issues almost invisible for far too long.

Increasing observability is the standard approach, but problems abound: Are you exposing the right metrics? Are the logs actionable? Have you instrumented the code so that it collects traces?

The problem intensifies when your team is small. There’s an awkward (but common!) middle ground between startup and enterprise, between not having Kubernetes at all and having a fully-fledged DevOps team, where this complexity can sometimes feel unsolvable. Companies of this size often benefit from using Kubernetes but only have one or two DevOps engineers, making Kubernetes complexity difficult to manage.

Many teams end up stuck or overwhelmed. You need Kubernetes, and you need to be able to monitor your environment, but you don’t have the resources to staff up. Your team, if there’s even a second DevOps engineer, has too much Kubernetes monitoring and management work to do – seemingly at all times.

The solution lies far, far upstream—not in a change to Kubernetes itself or a Kubernetes management tool glued to your stack. With the introduction of eBPF and the resulting ability to modify Linux at the kernel level, a new level of Kubernetes monitoring is now possible. With Anteon, built with eBPF as a first principle, a granular Kubernetes monitoring tool becomes accessible to even the smallest DevOps teams.

A brief history of eBPF

The tech industry often feels dominated by the newest, fastest players, but under the hood, decades-old technologies tend to rule.

Over three decades since it debuted, the Linux operating system has become a dominant operating system—running relatively few end-user machines but millions of web servers, supercomputers, and smart devices.

This success has made Linux, especially at the kernel level, hard to update. Any change would affect untold millions of devices and all the devices and systems that depend on them, so Linux maintainers are rightfully guarded. As a result of this carefulness, however, many potential innovations have struggled to germinate and spread—until eBPF.

Origins of eBPF

eBPF stands for Extended Berkeley Packet Filter, and the cleverness of the idea is embedded right in the name.

Before eBPF, developers attempted to put virtual-machine-level solutions into the Linux kernel, but the maintainers rejected the efforts. These rejections didn’t go against the quality of the efforts, but against the disruption they might have caused. Without those solutions available, though, developers relied on lackluster workarounds, which created the incentives for a different kind of solution.

Eventually, two developers – Alexei Starovoitov and Daniel Borkmann – started working together to solve the problem in a more significant, more lasting way. Instead of inserting something brand new into Linux, they extended the capabilities of an existing component—the Berkeley Packet Filter—into eBPF. They did so slowly and carefully over time, ensuring stability the whole way through and eventually building the groundwork for a technology that everyone looking for improved open-source monitoring is now eying.

The core difference that eBPF offers is programmability without disruption. With eBPF, the Linux kernel becomes programmable, which means developers can modify how Linux behaves without requesting the Linux maintainers make changes to the codebase. A technology that was once nearly untouchable has become endlessly customizable.

The spread of eBPF

Until 2021, eBPF was exclusive to Linux, and the ability to work with it was relatively exclusive to people with deep kernel experience. Tech leaders recognized that an industry-wide transformation would happen early on, but adoption needed to become more accessible before the momentum really took off.

Major companies, such as Meta, Google, and Netflix, all have eBPF running in their infrastructures, and in 2021, Microsoft created eBPF for Windows, paving the way toward industry-wide eBPF standardization.

As these efforts and the benefits they offer become more widely known, smaller and smaller companies will pursue eBPF and eBPF-based tools en masse. The demand for Linux customization has been present for decades, so these changes – as big as they have been – are only the beginning.

Three core use cases of eBPF

A technology shift like eBPF causes too many benefits to name: The scale of the change itself is significant, but each change will create a cascade of secondary changes. The shift is less like Facebook emerging and defeating Myspace and Friendster and more like Rust emerging and quickly becoming a new paradigm for bare-metal development.

As of now, the core benefits of eBPF come down to three categories: Performance, visibility, and freedom.

Performance improvements for companies with large networking and cloud expenditures

With eBPF, companies can iterate more rapidly, improve performance dramatically, and save enormous resources as a result. These improvements tend to scale with networking and cloud expenditures, which is a big reason why so many large companies have been early eBPF adopters.

Netflix, for example, has been contributing to eBPF since 2014 and built a network observability sidecar using eBPF, which has granted it much greater and much more performant observability. The sidecar, called Flow Exporter, debuted in 2015. Thanks to eBPF tracepoints, it can, according to a blog post from Netflix, “capture TCP flows at near real-time.”

For a company as big as Netflix, the ability to capture more and better data with little impact on performance is massive – and likely entirely impossible without eBPF. “At much less than 1% of CPU and memory on the instance,” the post continues, “This highly performant sidecar provides flow data at scale for network insight.”

Netflix isn’t alone. Amazon, as another example, offers a threat detection service called Amazon GuardDuty and improved upon it by adding runtime monitoring.

In a post on the subject, Amazon explained that the agent has upper limits of 1000m for CPU and 1GB for memory. Despite runtime monitoring requiring pressure on the node, however, “The only observable activity is the eBPF agent collecting data and forwarding it to Amazon GuardDuty for analysis.”

Thanks to eBPF, Amazon, like Netflix, was able to build and offer a new feature while ensuring minimal effect on performance.

Visibility improvements for companies operating cloud-native workloads split across microservices architectures

With eBPF, developers can gain a new level of visibility into their stack because it provides visibility from the kernel level—a boon to any company, but especially those using cloud-native environments and microservices architectures.

LinkedIn, for example, built an eBPF agent called Skyfall that has allowed the company to dramatically improve observability despite infrastructure sprawl.

In a post explaining Skyfall, the company explains the context, writing, “With LinkedIn's large infrastructure growth over the past few years, observability has become more critical to pinpoint the potential root causes for any infrastructure failure or anomaly.” This is where a lot of companies find themselves, even ones that aren’t quite as big. Infrastructure growth continues, but observability has, until recently, failed to keep up.

“We realized that the most reliable source for this would be to get this information from the servers themselves through a lightweight, widely deployed agent,” the post continues. “With eBPF, instead of tracing each packet, we are tracing syscalls at a layer closer to the application layer and summarizing the data in-kernel to have minimal tracing overhead.”

Once again, similar to the examples from Netflix and Amazon, eBPF has enabled a company to do more with less – to increase observability while causing little to no impact on performance.

Meta provides another example of eBPF in action. Like LinkedIn, a post from Meta emphasizes the challenge of scale: “Our infrastructure supports thousands of services that handle billions of requests per second.” The goal of the work described in the post, enforcing encryption at scale, then became both an obvious improvement and a significant challenge.

The post continues, explaining why the engineering team built an SSLWall, even though that strategy required kernel-level work: “Since we wanted to inspect every connection without needing any changes at the application level, we needed to do some work in the kernel context.”

That, of course, calls for eBPF: “We use eBPF extensively, and it provides all of the capabilities needed for SSLWall to achieve its goals.”

Freedom to innovate for any company running Linux

With eBPF, developers can build and run custom programs from within the Linux kernel, allowing them to add performance boosts, observability features, networking advancements, security features, and more.

This category is harder to capture than the previous two, but that speaks to its potential. With eBPF, the Linux kernel, which has been both essential and inflexible for decades, becomes an opportunity for innovation.

This is why Gobind Johar, a product manager for the Google Kubernetes Engine, and Varun Marupadi, a software engineer for the Google Kubernetes Engine, deem eBPF, in a post about security and visibility improvements, a “revolutionary technology.”

[Google_eBPF]

“Over the last few years,” Johar and Marupadi continue, “eBPF has become the standard way to address problems that previously relied on kernel changes or kernel modules.” They add that “eBPF has resulted in the development of a completely new generation of tooling in areas such as networking, security, and application profiling.”

And this is why there’s so much excitement about eBPF and its use cases: It’s a revolutionary technology unto itself, and it’s opened a new generation of tooling. No tooling category, however, will face as much change as soon as Kubernetes monitoring.

Why eBPF will transform Kubernetes monitoring

A step change at the infrastructure level – which eBPF clearly poses – will mean the rise of a new generation of open-source monitoring and observability tools.

The scale of this change is made more significant by the market surrounding it. Gartner, for example, predicts that more than 95% of workloads will be deployed on cloud-native platforms by 2025. The growth in cloud-native platforms is, in effect, a growth in demand for eBPF and eBPF-based tooling.

In her eBook What is eBPF?, Liz Rice, Chief Open Source Officer at Isovalent, explains further, writing, “Companies are moving their software to the cloud, and they want observability tools, networking tools, and security tools to help them do this. eBPF is a great platform for people to create these tools on and that makes it a very disruptive infrastructure technology.”

eBPF follows a shift to microservices and distributed networks—a shift that has benefited many companies but has made it difficult for those companies to observe their systems until now. Until eBPF, problems could take months to figure out. With the way eBPF improves observability, companies can much more quickly diagnose, localize, and fix issues.

Toke Høiland-Jørgensen, a Senior Principal Kernel Engineer at Red Hat, explains, “For people who are not kernel developers, the Linux kernel is like a black box. eBPF opens that black box and allows you to gain information on how the system is working that you couldn’t get before.”

As we said in the introduction to this piece, the way eBPF improves observability is especially useful to companies trying to manage Kubernetes monitoring and even more useful still to companies who don’t have the resources to expand their DevOps staff.

With Anteon, for example, users only need to run an open-source eBPF agent as a DaemonSet on their Kubernetes cluster to get a service map, metrics dashboard, and distributed tracing – all of which they can use to monitor their clusters and detect bottlenecks.

With this approach, DevOps teams don’t need to instrument code, restart services, or use sidecars, which minimizes overhead and simplifies deployment. This makes Kubernetes monitoring tools accessible to small teams with few resources.

Supported by eBPF, products like Anteon can enhance Kubernetes by providing users with real-time insights directly from the kernel. This is a fundamentally better way to detect and monitor performance issues and security threats because users can capture much more granular data without altering application code.

eBPF Kubernetes monitoring completes the cloud puzzle

The cloud arose decades ago, but it was never perfected.

In the years since, we’ve seen the rise of new architectural approaches and management frameworks, such as microservices and Kubernetes, but complexity has always outpaced many teams running cloud-native environments.

With eBPF, the industry slides another piece into the cloud puzzle, finally giving us the ability to observe cloud-based workloads, and with Anteon, Kubernetes monitoring becomes accessible to companies with as few as one DevOps engineers.

Share on social media:

Tags:

eBPF Kubernetes Monitoring