name: agenda class: middle, center # AI SRE Summit 2026 ## Komodor Academy ??? This is the intro slide. Welcome to AI SRE Summit. Talk to the audience, fill time. --- # Schedule and Introductions | Time | Topic | Description | |---------------|------------------|---------------------------------------------:| | 00:00 - 00:10 | Introduction | Overview of the workshop and objectives | | 00:10 - 00:30 | Komodor Overview | Why Komodor? | | 00:30 - 00:45 | Account Setup | Setting up your Komodor account | | 00:45 - 01:00 | Hands-On #1 | Uh oh, what happened to ledgerwriter? | | 01:00 - 01:15 | Hands-On #2 | Customers not seeing new version of frontend | | 01:15 - 01:30 | Hands-On #3 | userservice and accounts-db no longer talking | | 01:30 - 01:45 | Hands-On #4 | What the heck is an OOMKill? | | 01:45 - 02:00 | Hands-On #5 | I promise this is the correct version | | 02:00 - ??? | Conclusion | Thanks for Attending! Q&A | --- # Introduction
Why Komodor?
A brief history of
the universe
Komodor
---  ---  ---  --- class: middle, center ## Komodor Academy Workshops
Accessing your Lab Instance
--- ## Logging into Komodor
Retrieve the credentials provided by your instructor
Navigate to
app.komodor.com
--- ## Enter your information in the lab's Single Sign On Portal  --- ## Observe Bank of Anthos configured for your lab  --- ## The frontend service is accessible
Navigate to
https://$USERNAME.komodor-workshop.com
The application is fully functional, test some transactions after logging in.
--- ## Validate the application's frontend is accessibe
Bank of Anthos is a full functioning web application with a microservices architecture on Kubernetes. You are the SRE.
Your customers use Bank of Anthos to manage their checking accounts.
--- class: middle, center ## Komodor Academy Workshops
Hands-On Exercises
--- class: middle, center ## Komodor Academy Workshops
Hands-On Exercise 1: What happened to ledgerwriter?
--- ## Hands-On Exercise 1: What happened to ledgerwriter?
Scenarions in the Workshop are controlled by Jobs in your sandbox
Navigate to
Jobs
in the left navigation menu
Select
01-ledgerwriter
--- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd
Select the
Run now
button
This triggers the scenario to load
inside the deployment
Wait 1-2 minutes for the ledger-db service to come back online
--- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd
Navigate back to
Services
ledgerwriter
is now showing red, as an unhealthy service
Select the card to enter the service's
Events Timeline
--- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd  ??? Explain all the parts of the Events Timeline and why it's important. Have the attendee select "Investigate" from the events timeline to begin the Root Cause Analysis --- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd
Investigate
triggers a new Klaudia Root Cause Analysis
Who, What, Where, When, Why and How
Select the tab on the right side to open the Knowledge Graph
--- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd
Klaudia is a team of SRE experts at your calling
Hover over each card to learn what it investigated
Close the drawer, scroll down for the Suggested Remediation
??? Here you'll explain Klaudia's Agentic AI platform --- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd
The suggested remediation generates the
kubectl
command to run
Select Run, then confirm again with Klaudia to restart the deployment.
Close the drawer to validate the deployment is healthy
??? Here you'll explain Klaudia's Agentic AI platform --- ## Hands-On Exercise 1: What happened to ledgerwriter? Cont'd  --- class: middle, center ## Komodor Academy Workshops
Hands-On Exercise 2: Customers not seeing new version of frontend.
--- ## Hands-On Exercise 2: Customers not seeing new version of frontend
Navigate back to Jobs, select
02-frontend
As before, run the job in the Job overview
After several minutes, check the status of the deployment
--- ## Hands-On Exercise 2: Customers not seeing new version of frontend Cont'd  ??? The important thing to note here is the Failed Deploy --- ## Hands-On Exercise 2: Customers not seeing new version of frontend Cont'd  ??? Klaudia isn't just for Availability Events. In fact, this service is still up and running, but the deployment failed. Running Klaudia on a healthy service is possible, and promotes SREs to be proactive about the software they support. --- ## Hands-On Exercise 2: Customers not seeing new version of frontend Cont'd
The Deployment specifies this deployment runs on a specific node
This node doesn't exist in the current Kubernetes fleet
What does Klaudia suggest we do?
--- ## Hands-On Exercise 2: Customers not seeing new version of frontend Cont'd
Klaudia suggests rolling back to the previous revision
As before, select Run to rollback. Check the events timeline
Bonus points, do you know the name of the node selector? (Hint: 1980s)
--- class: middle, center ## Komodor Academy Workshops
Hands-On Exercise 3: userservice and accounts-db no longer talking
--- ## Hands-On Exercise 3: userservice and accounts-db no longer talking
Navigate back to Jobs, select
03-accounts-db
As before, run the job in the Job overview
After several minutes, check the status of the deployment
--- ## Hands-On Exercise 3: userservice and accounts-db no longer talking  - The `userservice` is unhealthy - We implemented a `NetworkPolicy` - Why did that cause the service to be unhealthy? ??? Teaching about the importance of readiness probes. This is a Kubernetes best practice. --- ## Hands-On Exercise 3: userservice and accounts-db no longer talking  ??? Teaching about the importance of readiness probes. This is a Kubernetes best practice. --- ## Hands-On Exercise 3: userservice and accounts-db no longer talking  ??? Teaching about the importance of readiness probes. This is a Kubernetes best practice. --- ## Hands-On Exercise 3: userservice and accounts-db no longer talking  - Select the `Availability issue` to begin automatic Root Cause Analysis - Klaudia observes all changes to the service, including the network policy - Klaudia presents the RCA in 1-2 minutes, with a suggested fix ??? Teaching about the importance of readiness probes. This is a Kubernetes best practice. --- ## Hands-On Exercise 3: userservice and accounts-db no longer talking
Klaudia notes the exact time of the Network Policy implementation
The Readiness Probe is what helped us discover the implementation of the Network Policy
The userservice continues to be down until we correct the error here. What's the fix?
--- ## Hands-On Exercise 3: userservice and accounts-db no longer talking
Klaudia doesn't suggest simply rolling back. She infers that userservice needs to talk to accounts-db
The readiness probe is a significant contextual clue on why rolling back isn't the right move
We can update the policy here on the cluster. Eventually we'd check this into source control.
??? Knowledge Base could help point Klaudia to suggest opening a PR in the future. --- ## Hands-On Exercise 3: userservice and accounts-db no longer talking  After the NetworkPolicy was patched, the readiness probes now succeeds on the next pod restart. The service regains it's healthy status. ??? Teaching about the importance of readiness probes. This is a Kubernetes best practice. --- class: middle, center ## Komodor Academy Workshops
Hands-On Exercise 4: What the heck is an OOMKill?
--- ## Hands-On Exercise 4: What the heck is an OOMKill?
Navigate back to Jobs, select
04-contacts
As before, run the job in the Job overview
After several minutes, navigate to Services to see the failed deployment
--- ## Hands-On Exercise 4: What the heck is an OOMKill? Cont'd  The service appears healthy, but the deployment has failed. The previous pod will remain online, but it's important for us to understand why the contacts service deployment is failing. Navigate into the events timeline, select the `Deploy` event, and start Klaudia RCA. ??? OOMKill notes --- ## Hands-On Exercise 4: What the heck is an OOMKill?
The contact's service had memory limits reduced from
128mi
to
64Mi
The standard Kubernetes exit code for this is
137
The deployment failed, but the previous pods are still running. Resiliency!
--- ## Hands-On Exercise 4: What the heck is an OOMKill? Cont'd
A
request
simply says how much memory/cpu a container could have. It can go higher.
A
limit
tells the kubelet and kernel that this is what the container is bound to, from the perspective of cpu or memory
If a container hits the limit, it will either throttle the CPU in the container, or OOMKill the pod because of memory
??? OOMKill notes --- ## Hands-On Exercise 4: What the heck is an OOMKill? Cont'd
The contact's service had memory limits reduced from
128mi
to
64Mi
The standard Kubernetes exit code for this is
137
The deployment failed, but the previous pods are still running. Resiliency!
??? OOMKill notes --- ## Hands-On Exercise 4: What the heck is an OOMKill? Cont'd  ??? OOMKill notes --- class: middle, center ## Komodor Academy Workshops
Hands-On Exercise 5: I promise this is the correct version
--- ## Hands-On Exercise 5: I promise this is the correct version
Navigate back to Jobs, select
05-ledgerwriter
As before, run the job in the Job overview
After several minutes, check the status of the deployment under Services
--- ## Hands-On Exercise 4: What the heck is an OOMKill? Cont'd  ??? OOMKill notes --- ## Hands-On Exercise 5: I promise this is the correct version
Our release pipeline's image tag was incorrect.
Klaudia lets us know that a registry wide check confirms this is a tagging problem
Our previous container is still running the previous tag, we are safe to rollback
--- ## Hands-On Exercise 5: I promise this is the correct version
Klaudia will rollback the deployment to our current version
This will stop the
ImagePullBackOff
error
Select
Run
to initiate the rollback sequence in Kubernetes
--- ## Hands-On Exercise 4: What the heck is an OOMKill? Cont'd  ??? OOMKill notes --- ## Komodor Academy Workshops
Thank You!