March 11, 2026AI + Cloud13 min read

ICurated219AIToolsforDevOpsEngineers(AndYouCanUseThemAllforFree)

DevOpsAIAutomationToolsKubernetesTerraformCI/CDInfrastructureSREPlatform EngineeringOpen Source

Over the past 6 months, I've watched the DevOps AI landscape explode from a handful of experimental tools to an overwhelming ecosystem of 500+ products, agents, and frameworks.

Every week there's a new "AI-powered" monitoring tool, a "ChatGPT for Kubernetes," or an "autonomous incident response agent." Most are vaporware. Some are genuinely transformative. Figuring out which is which takes hours of research you don't have.

So I did it for you.

Awesome DevOps AI is a curated list of ** 219 AI tools, agents, and MCP servers** that actually work in production DevOps environments. Every tool has been evaluated, categorized, and documented. No hype, no sponsored placements, no affiliate links, just the tools that matter.

If you're using AI for infrastructure, this is your new homepage.

Star the repo here

Why This List Exists

The problem with AI tooling in 2026 is discoverability, not availability.

There are AI tools for:

Writing Terraform (12+ tools)
Kubernetes troubleshooting (15+ tools)
Incident response (20+ tools)
Cost optimization (10+ tools)
Log analysis (18+ tools)

But they're scattered across GitHub, Product Hunt, vendor marketing sites, and random blog posts. You either stumble across them by accident or you don't know they exist.

Awesome DevOps AI solves this. One list. 20 categories. 219 tools. Updated monthly.

What's Inside

AI Coding Agents for Infrastructure (15 tools)

The first category covers AI coding assistants that understand infrastructure code:

Claude Code - Anthropic's agent that excels at large-scale Terraform refactoring and multi-file Kubernetes manifests
Codex - OpenAI's autonomous agent with cloud sandbox execution, strong at IaC from natural language
Cursor - AI-first IDE with inline Terraform/YAML completions
GitHub Copilot - Integrated everywhere, best for inline completions
Cline - Autonomous agent for VS Code that runs terminal commands and edits files

Plus 10 more including Devin, Continue, Amazon Q Developer, Sourcegraph Cody, and Windsurf.

Use case: You describe infrastructure in English ("create an EKS cluster with Karpenter and ALB controller"), the agent writes the Terraform + Kubernetes manifests.

AI-Powered Kubernetes (11 tools)

Tools specifically for Kubernetes cluster management and troubleshooting:

K8sGPT - CNCF Sandbox project that scans clusters for issues and explains them in plain English
HolmesGPT - Agentic troubleshooting combining observability telemetry with LLM reasoning
Robusta - Monitoring platform with AI root cause analysis
Komodor - AI-driven troubleshooting with change tracking and automated remediation

Plus kubectl-ai (Google's natural language kubectl plugin), KAITO (LLM inference operator), and more.

Use case: A pod is CrashLoopBackOff. K8sGPT analyzes it and says: "ImagePullBackOff: Container image 'nginx:1.25' not found. Check image name and registry credentials."

AI-Powered Terraform and IaC (11 tools)

AI tools that enhance Infrastructure as Code workflows:

Pulumi AI - Generates IaC programs from natural language across AWS/Azure/GCP/Kubernetes
Brainboard - Visual Terraform designer with AI architecture generation from diagrams
Firefly - Detects drift, generates Terraform from existing resources, manages IaC coverage gaps
Infracost - Cloud cost estimates in PRs for 1,100+ resources

Plus AWS Terraform MCP Server, Spacelift AI, Env0, and HashiCorp's Terraform Copilot prompts.

Use case: You draw an architecture diagram. Brainboard generates the Terraform. Infracost estimates the monthly cost ($2,400/month). You review and apply.

AI Incident Response (12 tools)

AI systems that detect, investigate, and remediate production incidents:

HolmesGPT - CNCF Sandbox agentic AI for automated incident investigation
IncidentFox - Open-source AI SRE platform for hypothesis formation and fix suggestions
Rootly - AI-powered incident management with automated timelines and AI-generated postmortems
Shoreline - Converts runbooks into automated remediation executing across fleets

Plus PagerDuty AIOps, Moogsoft, Tracecat, FireHydrant, and BigPanda.

Use case: API latency spikes. HolmesGPT correlates metrics, logs, and traces. Identifies a database connection pool leak. Suggests scaling the pool. Shoreline auto-remediates.

AI Monitoring and Observability (14 tools)

AI-enhanced monitoring, alerting, and observability platforms:

Datadog Bits AI - Natural language metric queries, root cause analysis, automated investigation
Grafana AI - SRE agent for root cause analysis, adaptive telemetry for cost reduction
Dynatrace Davis AI - Causal AI for automated root cause, impact assessment, predictive detection
Metoro Guardian - Combines telemetry + code analysis for accurate RCA and auto-generated fix PRs

Plus New Relic AI, Splunk AI, Coralogix, Chronosphere, Groundcover, and Sumo Logic.

Use case: Memory leak alert fires. Grafana AI analyzes metrics, identifies the leaking service, generates a flamegraph, and suggests code changes.

AI Security Scanning (15 tools)

AI-powered security for infrastructure, containers, and supply chain:

Snyk - DeepCode AI engine scans code, containers, IaC, and AI-generated code in real-time
Wiz - Unifies vulnerability findings with cloud context to prioritize exploitable risks
Checkov - Static analysis for IaC scanning Terraform, CloudFormation, K8s, Helm
Trivy - Comprehensive open-source scanner for containers, IaC, Kubernetes, code

Plus Prisma Cloud, Orca Security, Lacework, Aqua, GitGuardian, Semgrep, and Falco.

Use case: PR with Terraform changes. Checkov scans and finds: "S3 bucket without encryption, security group with 0.0.0.0/0 ingress." Snyk suggests fixes inline.

AI Cost Optimization (10 tools)

AI for cloud cost management, FinOps, and resource optimization:

CAST AI - AI-powered Kubernetes cost optimization with automated rightsizing and spot instance management
Kubecost - Real-time K8s cost monitoring by service, deployment, namespace, container
Turbonomic - IBM AI that continuously optimizes compute, storage, network allocation
Vantage - Cloud cost transparency with AI recommendations across AWS, Azure, GCP, K8s, Datadog

Plus nOps, Spot by NetApp, CloudZero, Anodot, Finout, and OpenCost.

Use case:$12k monthly AWS bill. CAST AI finds overprovisioned pods, switches to spot instances, rightsizes resources. New bill: $4,800/month (60% savings).

MCP Servers for DevOps (12 servers)

Model Context Protocol servers that give AI assistants direct access to DevOps tools:

AWS MCP Servers - Official suite covering Terraform, CDK, CloudFormation, Lambda, S3, CloudWatch, ECS
GitHub MCP Server - Official server for repos, issues, PRs, Actions, code search
Kubernetes MCP Server - kubectl operations, pod management, cluster introspection
Docker MCP Gateway - Docker-maintained server for container management and Compose workflows

Plus Terraform MCP Server (HashiCorp), Cloudflare MCP, Atlassian MCP (Jira/Confluence), Slack MCP, and Sentry MCP.

Use case: You ask Claude: "What pods are failing in production?" The Kubernetes MCP Server runs kubectl get pods --all-namespaces -o wide and returns the results. Claude analyzes and suggests fixes.

AI-Powered CI/CD (10 tools)

AI tools that enhance continuous integration and delivery:

PR-Agent - Auto-describes, reviews, improves, and generates tests for PRs
GitLab Duo - AI across the entire DevSecOps platform with code suggestions, root cause analysis, vulnerability resolution
Harness AIDA - AI Development Assistant for intelligent pipeline creation and failure analysis
Trunk - AI-powered code quality checks, merge queues, flaky test management

Plus Mergify, CircleCI AI, Dagger, ArgoCD, Tekton, and Codefresh.

Use case: PR submitted. PR-Agent auto-generates description, reviews for security issues, suggests improvements, and generates missing unit tests, all before human review.

AI Log Analysis and Debugging (9 tools)

AI tools for log analysis, pattern detection, debugging:

LogAI - Salesforce's open-source toolkit with ML for anomaly detection, clustering, summarization
VectorLog - AI log analysis with natural language querying and pattern recognition
Loki - Log aggregation paired with Grafana AI for intelligent querying

Plus Elasticsearch with ES|QL, Axiom, OpenObserve, Vector, Fluentd, Logstash.

Use case: 500k error logs. LogAI clusters them into 12 patterns. Top pattern: "Database connection timeout" (78% of errors). AI suggests increasing connection pool.

Categories I Didn't Even List Above

The repository also covers:

AI Agent Frameworks (8 tools) - LangChain, CrewAI, AutoGen, etc.
AI for Platform Engineering (6 tools) - Port, Backstage, Kratix
AI for Database Operations (5 tools) - EverSQL, Metis, pganalyze
AI for Networking (4 tools) - Cilium, Calico Enterprise
AI for Chaos Engineering (3 tools) - Gremlin, LitmusChaos
AI for GitOps (5 tools) - Fleet, Flux, Rancher
System Prompt Templates (repositories of CLAUDE.md templates)
Learning Resources (tutorials, courses, certifications)
Community and Newsletters (where to stay updated)

Total: 219 tools across 20 categories

Why "Awesome"?

The list follows the Awesome List guidelines, which means:

Curated, not comprehensive - Every tool was evaluated for quality and usefulness
No dead links - Automated link checking ensures everything works
No sponsorships - Tools are included because they're good, not because someone paid
Community-driven - Anyone can submit PRs to add or update tools
Actively maintained - Updated monthly with new tools and category changes

The repository has:

Awesome List badge
Automated link checking (GitHub Actions)
Awesome lint validation
GitHub Pages site: hammadhaqqani.github.io/awesome-devops-ai/

Real Use Cases from Production

Here's how I've used tools from this list in the past 3 months:

Use Case 1: Kubernetes Troubleshooting

Problem: 15 pods stuck in CrashLoopBackOff. Support tickets piling up.

Tools used:

K8sGPT - Scanned cluster, identified 3 root causes:

8 pods: ImagePullBackOff (typo in image tag)
5 pods: OOMKilled (memory limits too low)
2 pods: ConfigMap missing (deleted by accident)

Robusta - Auto-created tickets with context for each issue
kubectl-ai - "Fix all ImagePullBackOff issues in namespace prod" → Generated and applied patches

Time saved: 45 minutes → 5 minutes

Use Case 2: Terraform Refactoring

Problem: 200-file Terraform monorepo. Need to extract VPC module. Manual refactoring would take days.

Tools used:

Claude Code - Analyzed entire repo, identified VPC resources across 47 files
Claude Code - Extracted to module, updated all references, generated README
Infracost - Verified no cost changes from refactoring
Checkov - Scanned for security issues introduced during refactoring

Time saved: 3 days → 4 hours

Use Case 3: Incident Response

Problem: API latency spiked from 50ms to 2,500ms. Customers complaining.

Tools used:

HolmesGPT - Correlated metrics, logs, traces. Hypothesis: "Database connection pool exhausted"
Datadog Bits AI - Confirmed database connections at 100/100
Shoreline - Auto-remediated: increased connection pool to 200, restarted pods
Rootly - Generated postmortem draft with timeline and root cause

Time to resolve: 3 minutes (fully automated)

Use Case 4: Cost Optimization

Problem: Monthly AWS bill: $18k. CFO wants 30% reduction.

Tools used:

CAST AI - Analyzed Kubernetes cluster, found massive overprovisioning
Kubecost - Identified top 10 expensive services
Vantage - Found $4k in unused RDS snapshots, abandoned EBS volumes
CAST AI - Implemented recommendations (rightsizing, spot instances, autoscaling)

Result: Monthly bill reduced to $11,200 (38% savings, exceeded target)

How to Use This List

Strategy 1: Start with Your Biggest Pain Point

Find the category that matches your problem:

Pods crashing → AI-Powered Kubernetes
Terraform taking too long → AI Coding Agents
Incidents taking hours to resolve → AI Incident Response
Cloud bill too high → AI Cost Optimization
Security vulnerabilities → AI Security Scanning

Pick 2-3 tools from that category. Test them. Keep what works.

Strategy 2: Stack Multiple Tools

Best results come from combining tools:

Example stack for incident response:

Alert fires (PagerDuty)
 ↓
HolmesGPT investigates
 ↓
Datadog Bits AI correlates metrics
 ↓
Shoreline auto-remediates
 ↓
Rootly generates postmortem

Strategy 3: Use MCP Servers for AI Assistants

If you use Claude, ChatGPT, or Cursor, install MCP servers to give them superpowers:

AWS MCP Server - Claude can read CloudFormation templates, query Lambda logs
Kubernetes MCP Server - Claude can run kubectl commands, analyze pods
GitHub MCP Server - Claude can search code, create issues, review PRs
Terraform MCP Server - Claude can search modules, check documentation

Result: Your AI assistant becomes an actual DevOps engineer.

Contributing

This is a community project. If you:

Know a tool that's missing
Found a broken link
Want to add a category
Have a better description

Open a PR or create an issue

Guidelines:

Tool must be production-ready (no vaporware)
Must have documentation
Must be actively maintained (updated in last 6 months)
No affiliate links or sponsorships

Stay Updated

The DevOps AI landscape changes weekly. New tools launch. Old tools shut down. Categories emerge.

Ways to stay current:

Star the repo - Get notified of updates: github.com/hammadhaqqani/awesome-devops-ai
Watch releases - GitHub notifications when major updates happen
Subscribe to my newsletter - Monthly DevOps AI roundup at hammadhaqqani.com
Follow on X - @hammadhaqqani for real-time updates

Tools I Built (Also on the List)

I practice what I preach. Here are my contributions to the DevOps AI ecosystem:

Free Browser-Based Tools (71 total)

All free, no signup, run entirely in your browser:

AI & LLM Tools:

Token Counter - Estimate costs before API calls (25 models)
Prompt Injection Scanner - Detect 18 injection attack patterns
Context Window Visualizer - See how prompts fill model contexts
Fine-Tuning Data Formatter - Convert CSV (or manually-edited rows) to provider-specific JSONL for training

DevOps Tools:

YAML Validator - Validate K8s manifests, GitHub Actions
Docker Run to Compose - Convert commands to docker-compose.yml
Cron Expression Builder - Visual builder with presets
CIDR Calculator - Network calculations

See all 86 tools

Open Source Projects

Claude Code DevOps Toolkit - CLAUDE.md templates, prompts, automation scripts
Kubernetes GitOps Blueprint - Production K8s with ArgoCD, Helm, Kustomize
Terraform AWS Self-Healing Infra - Auto-remediation with CloudWatch + Lambda
DevOps Interview Handbook - Comprehensive interview prep

All listed in the Awesome DevOps AI repository.

The Future of DevOps AI

Three trends I'm tracking:

1. Fully Autonomous SRE Agents

We're moving from "AI-assisted" to "AI-autonomous":

Today: AI suggests fixes, you apply them
2027: AI detects, investigates, remediates, documents, all without human intervention

Tools like HolmesGPT, Shoreline, and IncidentFox are early versions of this.

2. MCP Servers Everywhere

Model Context Protocol is becoming the standard for AI-tool integration:

Every DevOps tool will have an MCP server
AI assistants will orchestrate entire workflows across tools
No more copying outputs between terminals

AWS, GitHub, Docker, HashiCorp, and Cloudflare already have official MCP servers.

3. AI-First Infrastructure

New infrastructure will be designed for AI from day one:

Kubernetes operators that self-tune based on AI recommendations
Terraform modules that auto-optimize based on cost/performance models
CI/CD pipelines that self-heal based on failure patterns

This is already happening with CAST AI for Kubernetes and Harness AIDA for CI/CD.

What to Do Next

Star the Awesome DevOps AI repository
Pick your biggest pain point (K8s, Terraform, incidents, cost, security)
Try 2-3 tools from the relevant category
Share feedback - Open an issue or PR if you find something useful
Spread the word - Help other engineers discover these tools

The faster we adopt AI in DevOps, the less time we spend firefighting and the more time we spend building.

Repository: github.com/hammadhaqqani/awesome-devops-ai

Website: hammadhaqqani.github.io/awesome-devops-ai/

Further Reading:

Questions about any of these tools? Email me at phaqqani@gmail.com or find me on LinkedIn.

The DevOps AI revolution is happening now. This list is your map.

How to Use AI Coding Assistants for Infrastructure as Code Without Breaking Production

March 11, 2026AI + Cloud14 min read

Awesome DevOps AI Crossed 219 Tools and a Few Thousand Stars. Here's What Changed in 60 Days.

May 10, 2026AI + Cloud5 min read

30 Days of Claude Code in Production: The Receipts Across 7 Projects

April 28, 2026AI + Cloud9 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

Back to Blog

March 11, 2026AI + Cloud13 min read

ICurated219AIToolsforDevOpsEngineers(AndYouCanUseThemAllforFree)

DevOpsAIAutomationToolsKubernetesTerraformCI/CDInfrastructureSREPlatform EngineeringOpen Source

Over the past 6 months, I've watched the DevOps AI landscape explode from a handful of experimental tools to an overwhelming ecosystem of 500+ products, agents, and frameworks.

So I did it for you.

If you're using AI for infrastructure, this is your new homepage.

Star the repo here

Why This List Exists

The problem with AI tooling in 2026 is discoverability, not availability.

There are AI tools for:

Writing Terraform (12+ tools)
Kubernetes troubleshooting (15+ tools)
Incident response (20+ tools)
Cost optimization (10+ tools)
Log analysis (18+ tools)

But they're scattered across GitHub, Product Hunt, vendor marketing sites, and random blog posts. You either stumble across them by accident or you don't know they exist.

Awesome DevOps AI solves this. One list. 20 categories. 219 tools. Updated monthly.

What's Inside

AI Coding Agents for Infrastructure (15 tools)

The first category covers AI coding assistants that understand infrastructure code:

Claude Code - Anthropic's agent that excels at large-scale Terraform refactoring and multi-file Kubernetes manifests
Codex - OpenAI's autonomous agent with cloud sandbox execution, strong at IaC from natural language
Cursor - AI-first IDE with inline Terraform/YAML completions
GitHub Copilot - Integrated everywhere, best for inline completions
Cline - Autonomous agent for VS Code that runs terminal commands and edits files

Plus 10 more including Devin, Continue, Amazon Q Developer, Sourcegraph Cody, and Windsurf.

Use case: You describe infrastructure in English ("create an EKS cluster with Karpenter and ALB controller"), the agent writes the Terraform + Kubernetes manifests.

AI-Powered Kubernetes (11 tools)

Tools specifically for Kubernetes cluster management and troubleshooting:

K8sGPT - CNCF Sandbox project that scans clusters for issues and explains them in plain English
HolmesGPT - Agentic troubleshooting combining observability telemetry with LLM reasoning
Robusta - Monitoring platform with AI root cause analysis
Komodor - AI-driven troubleshooting with change tracking and automated remediation

Plus kubectl-ai (Google's natural language kubectl plugin), KAITO (LLM inference operator), and more.

Use case: A pod is CrashLoopBackOff. K8sGPT analyzes it and says: "ImagePullBackOff: Container image 'nginx:1.25' not found. Check image name and registry credentials."

AI-Powered Terraform and IaC (11 tools)

AI tools that enhance Infrastructure as Code workflows:

Pulumi AI - Generates IaC programs from natural language across AWS/Azure/GCP/Kubernetes
Brainboard - Visual Terraform designer with AI architecture generation from diagrams
Firefly - Detects drift, generates Terraform from existing resources, manages IaC coverage gaps
Infracost - Cloud cost estimates in PRs for 1,100+ resources

Plus AWS Terraform MCP Server, Spacelift AI, Env0, and HashiCorp's Terraform Copilot prompts.

Use case: You draw an architecture diagram. Brainboard generates the Terraform. Infracost estimates the monthly cost ($2,400/month). You review and apply.

AI Incident Response (12 tools)

AI systems that detect, investigate, and remediate production incidents:

HolmesGPT - CNCF Sandbox agentic AI for automated incident investigation
IncidentFox - Open-source AI SRE platform for hypothesis formation and fix suggestions
Rootly - AI-powered incident management with automated timelines and AI-generated postmortems
Shoreline - Converts runbooks into automated remediation executing across fleets

Plus PagerDuty AIOps, Moogsoft, Tracecat, FireHydrant, and BigPanda.

Use case: API latency spikes. HolmesGPT correlates metrics, logs, and traces. Identifies a database connection pool leak. Suggests scaling the pool. Shoreline auto-remediates.

AI Monitoring and Observability (14 tools)

AI-enhanced monitoring, alerting, and observability platforms:

Datadog Bits AI - Natural language metric queries, root cause analysis, automated investigation
Grafana AI - SRE agent for root cause analysis, adaptive telemetry for cost reduction
Dynatrace Davis AI - Causal AI for automated root cause, impact assessment, predictive detection
Metoro Guardian - Combines telemetry + code analysis for accurate RCA and auto-generated fix PRs

Plus New Relic AI, Splunk AI, Coralogix, Chronosphere, Groundcover, and Sumo Logic.

Use case: Memory leak alert fires. Grafana AI analyzes metrics, identifies the leaking service, generates a flamegraph, and suggests code changes.

AI Security Scanning (15 tools)

AI-powered security for infrastructure, containers, and supply chain:

Snyk - DeepCode AI engine scans code, containers, IaC, and AI-generated code in real-time
Wiz - Unifies vulnerability findings with cloud context to prioritize exploitable risks
Checkov - Static analysis for IaC scanning Terraform, CloudFormation, K8s, Helm
Trivy - Comprehensive open-source scanner for containers, IaC, Kubernetes, code

Plus Prisma Cloud, Orca Security, Lacework, Aqua, GitGuardian, Semgrep, and Falco.

Use case: PR with Terraform changes. Checkov scans and finds: "S3 bucket without encryption, security group with 0.0.0.0/0 ingress." Snyk suggests fixes inline.

AI Cost Optimization (10 tools)

AI for cloud cost management, FinOps, and resource optimization:

CAST AI - AI-powered Kubernetes cost optimization with automated rightsizing and spot instance management
Kubecost - Real-time K8s cost monitoring by service, deployment, namespace, container
Turbonomic - IBM AI that continuously optimizes compute, storage, network allocation
Vantage - Cloud cost transparency with AI recommendations across AWS, Azure, GCP, K8s, Datadog

Plus nOps, Spot by NetApp, CloudZero, Anodot, Finout, and OpenCost.

Use case:$12k monthly AWS bill. CAST AI finds overprovisioned pods, switches to spot instances, rightsizes resources. New bill: $4,800/month (60% savings).

MCP Servers for DevOps (12 servers)

Model Context Protocol servers that give AI assistants direct access to DevOps tools:

AWS MCP Servers - Official suite covering Terraform, CDK, CloudFormation, Lambda, S3, CloudWatch, ECS
GitHub MCP Server - Official server for repos, issues, PRs, Actions, code search
Kubernetes MCP Server - kubectl operations, pod management, cluster introspection
Docker MCP Gateway - Docker-maintained server for container management and Compose workflows

Plus Terraform MCP Server (HashiCorp), Cloudflare MCP, Atlassian MCP (Jira/Confluence), Slack MCP, and Sentry MCP.

AI-Powered CI/CD (10 tools)

AI tools that enhance continuous integration and delivery:

PR-Agent - Auto-describes, reviews, improves, and generates tests for PRs
GitLab Duo - AI across the entire DevSecOps platform with code suggestions, root cause analysis, vulnerability resolution
Harness AIDA - AI Development Assistant for intelligent pipeline creation and failure analysis
Trunk - AI-powered code quality checks, merge queues, flaky test management

Plus Mergify, CircleCI AI, Dagger, ArgoCD, Tekton, and Codefresh.

Use case: PR submitted. PR-Agent auto-generates description, reviews for security issues, suggests improvements, and generates missing unit tests, all before human review.

AI Log Analysis and Debugging (9 tools)

AI tools for log analysis, pattern detection, debugging:

LogAI - Salesforce's open-source toolkit with ML for anomaly detection, clustering, summarization
VectorLog - AI log analysis with natural language querying and pattern recognition
Loki - Log aggregation paired with Grafana AI for intelligent querying

Plus Elasticsearch with ES|QL, Axiom, OpenObserve, Vector, Fluentd, Logstash.

Use case: 500k error logs. LogAI clusters them into 12 patterns. Top pattern: "Database connection timeout" (78% of errors). AI suggests increasing connection pool.

Categories I Didn't Even List Above

The repository also covers:

AI Agent Frameworks (8 tools) - LangChain, CrewAI, AutoGen, etc.
AI for Platform Engineering (6 tools) - Port, Backstage, Kratix
AI for Database Operations (5 tools) - EverSQL, Metis, pganalyze
AI for Networking (4 tools) - Cilium, Calico Enterprise
AI for Chaos Engineering (3 tools) - Gremlin, LitmusChaos
AI for GitOps (5 tools) - Fleet, Flux, Rancher
System Prompt Templates (repositories of CLAUDE.md templates)
Learning Resources (tutorials, courses, certifications)
Community and Newsletters (where to stay updated)

Total: 219 tools across 20 categories

Why "Awesome"?

The list follows the Awesome List guidelines, which means:

Curated, not comprehensive - Every tool was evaluated for quality and usefulness
No dead links - Automated link checking ensures everything works
No sponsorships - Tools are included because they're good, not because someone paid
Community-driven - Anyone can submit PRs to add or update tools
Actively maintained - Updated monthly with new tools and category changes

The repository has:

Awesome List badge
Automated link checking (GitHub Actions)
Awesome lint validation
GitHub Pages site: hammadhaqqani.github.io/awesome-devops-ai/

Real Use Cases from Production

Here's how I've used tools from this list in the past 3 months:

Use Case 1: Kubernetes Troubleshooting

Problem: 15 pods stuck in CrashLoopBackOff. Support tickets piling up.

Tools used:

K8sGPT - Scanned cluster, identified 3 root causes:

8 pods: ImagePullBackOff (typo in image tag)
5 pods: OOMKilled (memory limits too low)
2 pods: ConfigMap missing (deleted by accident)

Robusta - Auto-created tickets with context for each issue
kubectl-ai - "Fix all ImagePullBackOff issues in namespace prod" → Generated and applied patches

Time saved: 45 minutes → 5 minutes

Use Case 2: Terraform Refactoring

Problem: 200-file Terraform monorepo. Need to extract VPC module. Manual refactoring would take days.

Tools used:

Claude Code - Analyzed entire repo, identified VPC resources across 47 files
Claude Code - Extracted to module, updated all references, generated README
Infracost - Verified no cost changes from refactoring
Checkov - Scanned for security issues introduced during refactoring

Time saved: 3 days → 4 hours

Use Case 3: Incident Response

Problem: API latency spiked from 50ms to 2,500ms. Customers complaining.

Tools used:

HolmesGPT - Correlated metrics, logs, traces. Hypothesis: "Database connection pool exhausted"
Datadog Bits AI - Confirmed database connections at 100/100
Shoreline - Auto-remediated: increased connection pool to 200, restarted pods
Rootly - Generated postmortem draft with timeline and root cause

Time to resolve: 3 minutes (fully automated)

Use Case 4: Cost Optimization

Problem: Monthly AWS bill: $18k. CFO wants 30% reduction.

Tools used:

CAST AI - Analyzed Kubernetes cluster, found massive overprovisioning
Kubecost - Identified top 10 expensive services
Vantage - Found $4k in unused RDS snapshots, abandoned EBS volumes
CAST AI - Implemented recommendations (rightsizing, spot instances, autoscaling)

Result: Monthly bill reduced to $11,200 (38% savings, exceeded target)

How to Use This List

Strategy 1: Start with Your Biggest Pain Point

Find the category that matches your problem:

Pods crashing → AI-Powered Kubernetes
Terraform taking too long → AI Coding Agents
Incidents taking hours to resolve → AI Incident Response
Cloud bill too high → AI Cost Optimization
Security vulnerabilities → AI Security Scanning

Pick 2-3 tools from that category. Test them. Keep what works.

Strategy 2: Stack Multiple Tools

Best results come from combining tools:

Example stack for incident response:

Alert fires (PagerDuty)
 ↓
HolmesGPT investigates
 ↓
Datadog Bits AI correlates metrics
 ↓
Shoreline auto-remediates
 ↓
Rootly generates postmortem

Strategy 3: Use MCP Servers for AI Assistants

If you use Claude, ChatGPT, or Cursor, install MCP servers to give them superpowers:

AWS MCP Server - Claude can read CloudFormation templates, query Lambda logs
Kubernetes MCP Server - Claude can run kubectl commands, analyze pods
GitHub MCP Server - Claude can search code, create issues, review PRs
Terraform MCP Server - Claude can search modules, check documentation

Result: Your AI assistant becomes an actual DevOps engineer.

Contributing

This is a community project. If you:

Know a tool that's missing
Found a broken link
Want to add a category
Have a better description

Open a PR or create an issue

Guidelines:

Tool must be production-ready (no vaporware)
Must have documentation
Must be actively maintained (updated in last 6 months)
No affiliate links or sponsorships

Stay Updated

The DevOps AI landscape changes weekly. New tools launch. Old tools shut down. Categories emerge.

Ways to stay current:

Star the repo - Get notified of updates: github.com/hammadhaqqani/awesome-devops-ai
Watch releases - GitHub notifications when major updates happen
Subscribe to my newsletter - Monthly DevOps AI roundup at hammadhaqqani.com
Follow on X - @hammadhaqqani for real-time updates

Tools I Built (Also on the List)

I practice what I preach. Here are my contributions to the DevOps AI ecosystem:

Free Browser-Based Tools (71 total)

All free, no signup, run entirely in your browser:

AI & LLM Tools:

Token Counter - Estimate costs before API calls (25 models)
Prompt Injection Scanner - Detect 18 injection attack patterns
Context Window Visualizer - See how prompts fill model contexts
Fine-Tuning Data Formatter - Convert CSV (or manually-edited rows) to provider-specific JSONL for training

DevOps Tools:

YAML Validator - Validate K8s manifests, GitHub Actions
Docker Run to Compose - Convert commands to docker-compose.yml
Cron Expression Builder - Visual builder with presets
CIDR Calculator - Network calculations

See all 86 tools

Open Source Projects

Claude Code DevOps Toolkit - CLAUDE.md templates, prompts, automation scripts
Kubernetes GitOps Blueprint - Production K8s with ArgoCD, Helm, Kustomize
Terraform AWS Self-Healing Infra - Auto-remediation with CloudWatch + Lambda
DevOps Interview Handbook - Comprehensive interview prep

All listed in the Awesome DevOps AI repository.

The Future of DevOps AI

Three trends I'm tracking:

1. Fully Autonomous SRE Agents

We're moving from "AI-assisted" to "AI-autonomous":

Today: AI suggests fixes, you apply them
2027: AI detects, investigates, remediates, documents, all without human intervention

Tools like HolmesGPT, Shoreline, and IncidentFox are early versions of this.

2. MCP Servers Everywhere

Model Context Protocol is becoming the standard for AI-tool integration:

Every DevOps tool will have an MCP server
AI assistants will orchestrate entire workflows across tools
No more copying outputs between terminals

AWS, GitHub, Docker, HashiCorp, and Cloudflare already have official MCP servers.

3. AI-First Infrastructure

New infrastructure will be designed for AI from day one:

Kubernetes operators that self-tune based on AI recommendations
Terraform modules that auto-optimize based on cost/performance models
CI/CD pipelines that self-heal based on failure patterns

This is already happening with CAST AI for Kubernetes and Harness AIDA for CI/CD.

What to Do Next

Star the Awesome DevOps AI repository
Pick your biggest pain point (K8s, Terraform, incidents, cost, security)
Try 2-3 tools from the relevant category
Share feedback - Open an issue or PR if you find something useful
Spread the word - Help other engineers discover these tools

The faster we adopt AI in DevOps, the less time we spend firefighting and the more time we spend building.

Repository: github.com/hammadhaqqani/awesome-devops-ai

Website: hammadhaqqani.github.io/awesome-devops-ai/

Further Reading:

Questions about any of these tools? Email me at phaqqani@gmail.com or find me on LinkedIn.

The DevOps AI revolution is happening now. This list is your map.

How to Use AI Coding Assistants for Infrastructure as Code Without Breaking Production

March 11, 2026AI + Cloud14 min read

Awesome DevOps AI Crossed 219 Tools and a Few Thousand Stars. Here's What Changed in 60 Days.

May 10, 2026AI + Cloud5 min read

30 Days of Claude Code in Production: The Receipts Across 7 Projects

April 28, 2026AI + Cloud9 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

ICurated219AIToolsforDevOpsEngineers(AndYouCanUseThemAllforFree)

Why This List Exists

What's Inside

AI Coding Agents for Infrastructure (15 tools)

AI-Powered Kubernetes (11 tools)

AI-Powered Terraform and IaC (11 tools)

AI Incident Response (12 tools)

AI Monitoring and Observability (14 tools)

AI Security Scanning (15 tools)

AI Cost Optimization (10 tools)

MCP Servers for DevOps (12 servers)

AI-Powered CI/CD (10 tools)

AI Log Analysis and Debugging (9 tools)

Categories I Didn't Even List Above

Why "Awesome"?

Real Use Cases from Production

Use Case 1: Kubernetes Troubleshooting

Use Case 2: Terraform Refactoring

Use Case 3: Incident Response

Use Case 4: Cost Optimization

How to Use This List

Strategy 1: Start with Your Biggest Pain Point

Strategy 2: Stack Multiple Tools

Strategy 3: Use MCP Servers for AI Assistants

Contributing

Stay Updated

Tools I Built (Also on the List)

Free Browser-Based Tools (71 total)

Open Source Projects

The Future of DevOps AI

1. Fully Autonomous SRE Agents

2. MCP Servers Everywhere

3. AI-First Infrastructure

What to Do Next

Related Posts

How to Use AI Coding Assistants for Infrastructure as Code Without Breaking Production

Awesome DevOps AI Crossed 219 Tools and a Few Thousand Stars. Here's What Changed in 60 Days.

30 Days of Claude Code in Production: The Receipts Across 7 Projects

Stay ahead of the curve

ICurated219AIToolsforDevOpsEngineers(AndYouCanUseThemAllforFree)

Why This List Exists

What's Inside

AI Coding Agents for Infrastructure (15 tools)

AI-Powered Kubernetes (11 tools)

AI-Powered Terraform and IaC (11 tools)

AI Incident Response (12 tools)

AI Monitoring and Observability (14 tools)

AI Security Scanning (15 tools)

AI Cost Optimization (10 tools)

MCP Servers for DevOps (12 servers)

AI-Powered CI/CD (10 tools)

AI Log Analysis and Debugging (9 tools)

Categories I Didn't Even List Above

Why "Awesome"?

Real Use Cases from Production

Use Case 1: Kubernetes Troubleshooting

Use Case 2: Terraform Refactoring

Use Case 3: Incident Response

Use Case 4: Cost Optimization

How to Use This List

Strategy 1: Start with Your Biggest Pain Point

Strategy 2: Stack Multiple Tools

Strategy 3: Use MCP Servers for AI Assistants

Contributing

Stay Updated

Tools I Built (Also on the List)

Free Browser-Based Tools (71 total)

Open Source Projects

The Future of DevOps AI

1. Fully Autonomous SRE Agents

2. MCP Servers Everywhere

3. AI-First Infrastructure

What to Do Next

Related Posts

How to Use AI Coding Assistants for Infrastructure as Code Without Breaking Production

Awesome DevOps AI Crossed 219 Tools and a Few Thousand Stars. Here's What Changed in 60 Days.

30 Days of Claude Code in Production: The Receipts Across 7 Projects

Stay ahead of the curve