Uncovering a Bug in Kyverno
/ 6 min read
Last Updated:An image of the error: failed to mutate existing resoure…
Observing Odd Behavior
During due dilligence checks in a test Kubernetes (k8s) environment, I encountered
unexpected behavior when analyzing UpdateRequest
(UR) resources connected to a
MutateExisting
policy. When the policy was applied in isolation, it worked as
expected, however, when placed in a CI/CD pipeline of quick verification checks
where resources were stood up and torn down methodically, Kyverno appeared to
get “stuck.”
Infinitely Flipping States
I uncovered an infinite loop of Pending
and Failed
UR
states.
Immediately, I suspected some sort of race condition, whereby the quick
deletion of the K8s resource obstructed Kyverno’s ability to process its
mutation.
The Policy
The policy related to the URs
, was created to ensure that Ingress resources
had requisite annotations at all times, regardless of when the resource was
created.
Mutating Resourses with JMESPath
While the policy I developed is not necessarily unique; theoretically, any
MutateExisting
policy applied in a similar manner should cause the inifinite
loop bug.
Kyverno is an incredibly powerful tool, and with JMESPath you can accomplish a great deal. The Policy in question aligned with annotations found in the AWS Load Balancer Controller docs.
Demo code pulled from my GitHub Bug Report:
Scripting and Breaking Things
Using a simple bash script, a cursory overview of the overarching issue was replicated (using k3s):
Security Concerns
These issues led me to a thought experiment around resource abuse. More
specifically, if a threat actor or naive engineer were to perform testing,
creating and deleting resources, they could theoretically cause a build up of
URs
infinitely looping in the background of a cluster. Effectively
executing a form of Denial of Service, by spamming the cluster with objects that
consume resources.
Without adequate monitoring, this could potentially go unnoticed, until
resourcequotas
, if they exist, stop new resources from getting created. Not to
mention the ever-dreaded cloud spend and what is consumed by a potentially
uncontrolled deployment of infinitely spinning resources.
In all, workarounds were required, and alerts targeting these objects became the middle ground for dealing with this bug until a fix could be applied. This made me wonder, to what extent is a company ethically bound to at least create a disclaimer, when it is unclear that a bug has been fixed. Though I’d imagine, depending on the size of a team and its capacity, this varies. Admittedly, this is a CNCF incubating project, so that is definitely something to keep in mind.
Six months later, with other engineers hopping on the thread to share that they too were impacted by this issue; Kyverno still struggled with handling objects that were deleted with their accompanying K8s namespaces.
A Path Forward
Fast-forward eight months and a few minor releases later, the bug was still
present. With the passage of time, a fresh revival of the original bug report
was presented, sharing that the bug was in fact not corrected. In response, the
Kyverno team appeared to implement TTL functionality
on UR
resources.
While I have not directly contributed to the fix in question, I found the evolving nature of this recurring bug fascinating. For me, it spoke to the complexity of building applications in Kubernetes, and the frustrating nature of bugs that refuse to die. We’ve all been there, albeit myself, on a much smaller scale and stage. Nevertheless, I learned a great deal about the inner-workings of Kyverno and Kubernetes from a bug I discovered so many moons ago.