Over the past 6 months, I've watched the DevOps AI landscape explode from a handful of experimental tools to an overwhelming ecosystem of 500+ products, agents, and frameworks.
Every week there's a new "AI-powered" monitoring tool, a "ChatGPT for Kubernetes," or an "autonomous incident response agent." Most are vaporware. Some are genuinely transformative. Figuring out which is which takes hours of research you don't have.
So I did it for you.
Awesome DevOps AI is a curated list of 218 AI tools, agents, and MCP servers that actually work in production DevOps environments. Every tool has been evaluated, categorized, and documented. No hype, no sponsored placements, no affiliate linksβjust the tools that matter.
If you're using AI for infrastructure, this is your new homepage.
Why This List Exists
The problem with AI tooling in 2026 is discoverability, not availability.
There are AI tools for:
- Writing Terraform (12+ tools)
- Kubernetes troubleshooting (15+ tools)
- Incident response (20+ tools)
- Cost optimization (10+ tools)
- Log analysis (18+ tools)
But they're scattered across GitHub, Product Hunt, vendor marketing sites, and random blog posts. You either stumble across them by accident or you don't know they exist.
Awesome DevOps AI solves this. One list. 20 categories. 218 tools. Updated monthly.
What's Inside
π€ AI Coding Agents for Infrastructure (15 tools)
The first category covers AI coding assistants that understand infrastructure code:
- Claude Code - Anthropic's agent that excels at large-scale Terraform refactoring and multi-file Kubernetes manifests
- Codex - OpenAI's autonomous agent with cloud sandbox execution, strong at IaC from natural language
- Cursor - AI-first IDE with inline Terraform/YAML completions
- GitHub Copilot - Integrated everywhere, best for inline completions
- Cline - Autonomous agent for VS Code that runs terminal commands and edits files
Plus 10 more including Devin, Continue, Amazon Q Developer, Sourcegraph Cody, and Windsurf.
Use case: You describe infrastructure in English ("create an EKS cluster with Karpenter and ALB controller"), the agent writes the Terraform + Kubernetes manifests.
βΈοΈ AI-Powered Kubernetes (11 tools)
Tools specifically for Kubernetes cluster management and troubleshooting:
- K8sGPT - CNCF Sandbox project that scans clusters for issues and explains them in plain English
- HolmesGPT - Agentic troubleshooting combining observability telemetry with LLM reasoning
- Robusta - Monitoring platform with AI root cause analysis
- Komodor - AI-driven troubleshooting with change tracking and automated remediation
Plus kubectl-ai (Google's natural language kubectl plugin), KAITO (LLM inference operator), and more.
Use case: A pod is CrashLoopBackOff. K8sGPT analyzes it and says: "ImagePullBackOff: Container image 'nginx:1.25' not found. Check image name and registry credentials."
ποΈ AI-Powered Terraform and IaC (11 tools)
AI tools that enhance Infrastructure as Code workflows:
- Pulumi AI - Generates IaC programs from natural language across AWS/Azure/GCP/Kubernetes
- Brainboard - Visual Terraform designer with AI architecture generation from diagrams
- Firefly - Detects drift, generates Terraform from existing resources, manages IaC coverage gaps
- Infracost - Cloud cost estimates in PRs for 1,100+ resources
Plus AWS Terraform MCP Server, Spacelift AI, Env0, and HashiCorp's Terraform Copilot prompts.
Use case: You draw an architecture diagram. Brainboard generates the Terraform. Infracost estimates the monthly cost ($2,400/month). You review and apply.
π¨ AI Incident Response (12 tools)
AI systems that detect, investigate, and remediate production incidents:
- HolmesGPT - CNCF Sandbox agentic AI for automated incident investigation
- IncidentFox - Open-source AI SRE platform for hypothesis formation and fix suggestions
- Rootly - AI-powered incident management with automated timelines and AI-generated postmortems
- Shoreline - Converts runbooks into automated remediation executing across fleets
Plus PagerDuty AIOps, Moogsoft, Tracecat, FireHydrant, and BigPanda.
Use case: API latency spikes. HolmesGPT correlates metrics, logs, and traces. Identifies a database connection pool leak. Suggests scaling the pool. Shoreline auto-remediates.
π AI Monitoring and Observability (14 tools)
AI-enhanced monitoring, alerting, and observability platforms:
- Datadog Bits AI - Natural language metric queries, root cause analysis, automated investigation
- Grafana AI - SRE agent for root cause analysis, adaptive telemetry for cost reduction
- Dynatrace Davis AI - Causal AI for automated root cause, impact assessment, predictive detection
- Metoro Guardian - Combines telemetry + code analysis for accurate RCA and auto-generated fix PRs
Plus New Relic AI, Splunk AI, Coralogix, Chronosphere, Groundcover, and Sumo Logic.
Use case: Memory leak alert fires. Grafana AI analyzes metrics, identifies the leaking service, generates a flamegraph, and suggests code changes.
π AI Security Scanning (15 tools)
AI-powered security for infrastructure, containers, and supply chain:
- Snyk - DeepCode AI engine scans code, containers, IaC, and AI-generated code in real-time
- Wiz - Unifies vulnerability findings with cloud context to prioritize exploitable risks
- Checkov - Static analysis for IaC scanning Terraform, CloudFormation, K8s, Helm
- Trivy - Comprehensive open-source scanner for containers, IaC, Kubernetes, code
Plus Prisma Cloud, Orca Security, Lacework, Aqua, GitGuardian, Semgrep, and Falco.
Use case: PR with Terraform changes. Checkov scans and finds: "S3 bucket without encryption, security group with 0.0.0.0/0 ingress." Snyk suggests fixes inline.
π° AI Cost Optimization (10 tools)
AI for cloud cost management, FinOps, and resource optimization:
- CAST AI - AI-powered Kubernetes cost optimization with automated rightsizing and spot instance management
- Kubecost - Real-time K8s cost monitoring by service, deployment, namespace, container
- Turbonomic - IBM AI that continuously optimizes compute, storage, network allocation
- Vantage - Cloud cost transparency with AI recommendations across AWS, Azure, GCP, K8s, Datadog
Plus nOps, Spot by NetApp, CloudZero, Anodot, Finout, and OpenCost.
Use case: $12k monthly AWS bill. CAST AI finds overprovisioned pods, switches to spot instances, rightsizes resources. New bill: $4,800/month (60% savings).
π MCP Servers for DevOps (12 servers)
Model Context Protocol servers that give AI assistants direct access to DevOps tools:
- AWS MCP Servers - Official suite covering Terraform, CDK, CloudFormation, Lambda, S3, CloudWatch, ECS
- GitHub MCP Server - Official server for repos, issues, PRs, Actions, code search
- Kubernetes MCP Server - kubectl operations, pod management, cluster introspection
- Docker MCP Gateway - Docker-maintained server for container management and Compose workflows
Plus Terraform MCP Server (HashiCorp), Cloudflare MCP, Atlassian MCP (Jira/Confluence), Slack MCP, and Sentry MCP.
Use case: You ask Claude: "What pods are failing in production?" The Kubernetes MCP Server runs kubectl get pods --all-namespaces -o wide and returns the results. Claude analyzes and suggests fixes.
π AI-Powered CI/CD (10 tools)
AI tools that enhance continuous integration and delivery:
- PR-Agent - Auto-describes, reviews, improves, and generates tests for PRs
- GitLab Duo - AI across the entire DevSecOps platform with code suggestions, root cause analysis, vulnerability resolution
- Harness AIDA - AI Development Assistant for intelligent pipeline creation and failure analysis
- Trunk - AI-powered code quality checks, merge queues, flaky test management
Plus Mergify, CircleCI AI, Dagger, ArgoCD, Tekton, and Codefresh.
Use case: PR submitted. PR-Agent auto-generates description, reviews for security issues, suggests improvements, and generates missing unit testsβall before human review.
π AI Log Analysis and Debugging (9 tools)
AI tools for log analysis, pattern detection, debugging:
- LogAI - Salesforce's open-source toolkit with ML for anomaly detection, clustering, summarization
- VectorLog - AI log analysis with natural language querying and pattern recognition
- Loki - Log aggregation paired with Grafana AI for intelligent querying
Plus Elasticsearch with ES|QL, Axiom, OpenObserve, Vector, Fluentd, Logstash.
Use case: 500k error logs. LogAI clusters them into 12 patterns. Top pattern: "Database connection timeout" (78% of errors). AI suggests increasing connection pool.
Categories I Didn't Even List Above
The repository also covers:
- AI Agent Frameworks (8 tools) - LangChain, CrewAI, AutoGen, etc.
- AI for Platform Engineering (6 tools) - Port, Backstage, Kratix
- AI for Database Operations (5 tools) - EverSQL, Metis, pganalyze
- AI for Networking (4 tools) - Cilium, Calico Enterprise
- AI for Chaos Engineering (3 tools) - Gremlin, LitmusChaos
- AI for GitOps (5 tools) - Fleet, Flux, Rancher
- System Prompt Templates (repositories of CLAUDE.md templates)
- Learning Resources (tutorials, courses, certifications)
- Community and Newsletters (where to stay updated)
Total: 218 tools across 20 categories
Why "Awesome"?
The list follows the Awesome List guidelines, which means:
- Curated, not comprehensive - Every tool was evaluated for quality and usefulness
- No dead links - Automated link checking ensures everything works
- No sponsorships - Tools are included because they're good, not because someone paid
- Community-driven - Anyone can submit PRs to add or update tools
- Actively maintained - Updated monthly with new tools and category changes
The repository has:
- β Awesome List badge
- β Automated link checking (GitHub Actions)
- β Awesome lint validation
- β GitHub Pages site: hammadhaqqani.github.io/awesome-devops-ai/
Real Use Cases from Production
Here's how I've used tools from this list in the past 3 months:
Use Case 1: Kubernetes Troubleshooting
Problem: 15 pods stuck in CrashLoopBackOff. Support tickets piling up.
Tools used:
-
K8sGPT - Scanned cluster, identified 3 root causes:
- 8 pods: ImagePullBackOff (typo in image tag)
- 5 pods: OOMKilled (memory limits too low)
- 2 pods: ConfigMap missing (deleted by accident)
-
Robusta - Auto-created tickets with context for each issue
-
kubectl-ai - "Fix all ImagePullBackOff issues in namespace prod" β Generated and applied patches
Time saved: 45 minutes β 5 minutes
Use Case 2: Terraform Refactoring
Problem: 200-file Terraform monorepo. Need to extract VPC module. Manual refactoring would take days.
Tools used:
- Claude Code - Analyzed entire repo, identified VPC resources across 47 files
- Claude Code - Extracted to module, updated all references, generated README
- Infracost - Verified no cost changes from refactoring
- Checkov - Scanned for security issues introduced during refactoring
Time saved: 3 days β 4 hours
Use Case 3: Incident Response
Problem: API latency spiked from 50ms to 2,500ms. Customers complaining.
Tools used:
- HolmesGPT - Correlated metrics, logs, traces. Hypothesis: "Database connection pool exhausted"
- Datadog Bits AI - Confirmed database connections at 100/100
- Shoreline - Auto-remediated: increased connection pool to 200, restarted pods
- Rootly - Generated postmortem draft with timeline and root cause
Time to resolve: 3 minutes (fully automated)
Use Case 4: Cost Optimization
Problem: Monthly AWS bill: $18k. CFO wants 30% reduction.
Tools used:
- CAST AI - Analyzed Kubernetes cluster, found massive overprovisioning
- Kubecost - Identified top 10 expensive services
- Vantage - Found $4k in unused RDS snapshots, abandoned EBS volumes
- CAST AI - Implemented recommendations (rightsizing, spot instances, autoscaling)
Result: Monthly bill reduced to $11,200 (38% savings, exceeded target)
How to Use This List
Strategy 1: Start with Your Biggest Pain Point
Find the category that matches your problem:
- Pods crashing β AI-Powered Kubernetes
- Terraform taking too long β AI Coding Agents
- Incidents taking hours to resolve β AI Incident Response
- Cloud bill too high β AI Cost Optimization
- Security vulnerabilities β AI Security Scanning
Pick 2-3 tools from that category. Test them. Keep what works.
Strategy 2: Stack Multiple Tools
Best results come from combining tools:
Example stack for incident response:
Alert fires (PagerDuty)
β
HolmesGPT investigates
β
Datadog Bits AI correlates metrics
β
Shoreline auto-remediates
β
Rootly generates postmortem
Strategy 3: Use MCP Servers for AI Assistants
If you use Claude, ChatGPT, or Cursor, install MCP servers to give them superpowers:
- AWS MCP Server - Claude can read CloudFormation templates, query Lambda logs
- Kubernetes MCP Server - Claude can run kubectl commands, analyze pods
- GitHub MCP Server - Claude can search code, create issues, review PRs
- Terraform MCP Server - Claude can search modules, check documentation
Result: Your AI assistant becomes an actual DevOps engineer.
Contributing
This is a community project. If you:
- Know a tool that's missing
- Found a broken link
- Want to add a category
- Have a better description
Guidelines:
- Tool must be production-ready (no vaporware)
- Must have documentation
- Must be actively maintained (updated in last 6 months)
- No affiliate links or sponsorships
Stay Updated
The DevOps AI landscape changes weekly. New tools launch. Old tools shut down. Categories emerge.
Ways to stay current:
- β Star the repo - Get notified of updates: github.com/hammadhaqqani/awesome-devops-ai
- π Watch releases - GitHub notifications when major updates happen
- π§ Subscribe to my newsletter - Monthly DevOps AI roundup (coming soon)
- π¦ Follow on X - @hammadhaqqani for real-time updates
Tools I Built (Also on the List)
I practice what I preach. Here are my contributions to the DevOps AI ecosystem:
Free Browser-Based Tools (71 total)
All free, no signup, run entirely in your browser:
AI & LLM Tools:
- Token Counter - Estimate costs before API calls (23 models)
- Prompt Injection Scanner - Detect 18 injection attack patterns
- Context Window Visualizer - See how prompts fill model contexts
- Fine-Tuning Data Formatter - Convert CSV/JSON to JSONL for training
DevOps Tools:
- YAML Validator - Validate K8s manifests, GitHub Actions
- Docker Run to Compose - Convert commands to docker-compose.yml
- Cron Expression Builder - Visual builder with presets
- CIDR Calculator - Network calculations
Open Source Projects
- Claude Code DevOps Toolkit - CLAUDE.md templates, prompts, automation scripts
- Kubernetes GitOps Blueprint - Production K8s with ArgoCD, Helm, Kustomize
- Terraform AWS Self-Healing Infra - Auto-remediation with CloudWatch + Lambda
- DevOps Interview Handbook - Comprehensive interview prep
All listed in the Awesome DevOps AI repository.
The Future of DevOps AI
Three trends I'm tracking:
1. Fully Autonomous SRE Agents
We're moving from "AI-assisted" to "AI-autonomous":
- Today: AI suggests fixes, you apply them
- 2027: AI detects, investigates, remediates, documentsβall without human intervention
Tools like HolmesGPT, Shoreline, and IncidentFox are early versions of this.
2. MCP Servers Everywhere
Model Context Protocol is becoming the standard for AI-tool integration:
- Every DevOps tool will have an MCP server
- AI assistants will orchestrate entire workflows across tools
- No more copying outputs between terminals
AWS, GitHub, Docker, HashiCorp, and Cloudflare already have official MCP servers.
3. AI-First Infrastructure
New infrastructure will be designed for AI from day one:
- Kubernetes operators that self-tune based on AI recommendations
- Terraform modules that auto-optimize based on cost/performance models
- CI/CD pipelines that self-heal based on failure patterns
This is already happening with CAST AI for Kubernetes and Harness AIDA for CI/CD.
What to Do Next
- β Star the Awesome DevOps AI repository
- Pick your biggest pain point (K8s, Terraform, incidents, cost, security)
- Try 2-3 tools from the relevant category
- Share feedback - Open an issue or PR if you find something useful
- Spread the word - Help other engineers discover these tools
The faster we adopt AI in DevOps, the less time we spend firefighting and the more time we spend building.
Repository: github.com/hammadhaqqani/awesome-devops-ai
Website: hammadhaqqani.github.io/awesome-devops-ai/
Further Reading:
- How to Use AI Coding Assistants for Infrastructure
- Prompt Injection Attacks: How to Prevent Them
- Fine-Tuning vs RAG: When to Use Each
- LLM Context Windows Explained
Questions about any of these tools? Email me at phaqqani@gmail.com or find me on LinkedIn.
The DevOps AI revolution is happening now. This list is your map.