You've probably tried using ChatGPT or Claude to write Terraform. Maybe you pasted an error message and got back confident-looking code that made things worse. Or you asked it to "create an AWS VPC" and got a configuration that would work perfectly in a tutorial but fail spectacularly in production.
I've been using AI coding assistants for infrastructure work every day for the past 18 months. I've seen what works, what fails catastrophically, and what separates productive AI-assisted infrastructure engineering from expensive disasters.
This is the guide I wish existed when I started. No hype. No "AI will replace DevOps" nonsense. Just the practical patterns that let you ship infrastructure faster without breaking things.
The Problem: Why Most DevOps Engineers Give Up on AI Assistants
Three things happen when engineers first try AI for infrastructure code:
1. The code looks perfect but has invisible landmines
You ask Claude to write an EKS cluster configuration. It returns beautiful Terraform with detailed comments. You apply it. The cluster creates successfully. Three weeks later your pods can't talk to each other because the networking configuration assumed a default VPC that doesn't exist in production.
2. The AI confidently suggests deprecated patterns
ChatGPT recommends using AWS Classic Load Balancers when ALBs have been the standard for years. Or suggests Terraform syntax from version 0.11 when you're on 1.7. The code technically works but sets you back months in maintainability.
3. Zero understanding of your environment
You ask for a Kubernetes deployment manifest. The AI has no idea you're running on GKE with Workload Identity, need specific labels for your service mesh, or have naming conventions for every resource. The output is generic and useless.
Most engineers try it once, get burned, and go back to doing everything manually.
The problem isn't that AI assistants are bad at infrastructure code. The problem is that infrastructure code has invisible requirements that documentation-trained models can't infer from a vague prompt.
What Works: The Three-Layer Strategy
After 18 months of daily use across AWS, Azure, Terraform, Kubernetes, CloudFormation, and CDK, here's what actually works:
Layer 1: Context Transfer (The 80% Solution)
Before asking AI for anything infrastructure-related, you need to transfer three pieces of context:
Your existing patterns
Instead of:
"Write Terraform for an S3 bucket"
Do this:
"Write Terraform for an S3 bucket following the pattern in this existing module:
[paste your team's existing S3 module]
Match the same:
- Naming convention (project-environment-purpose)
- Tagging structure (Owner, Environment, CostCenter, ManagedBy)
- Encryption defaults (aws:kms with our standard key)
- Lifecycle rules (transition to IA after 30 days)
- Versioning and replication config
"
Your constraints
AI models default to "best practices" from the internet. Your production environment has constraints that trump best practices:
"Write a Kubernetes deployment manifest for our API service.
Constraints:
- Must use our base image (ecr.io/company/base-api:v2.3)
- Resource limits: 500m CPU, 1Gi memory (hard limit, finance approval required to exceed)
- Must have readiness and liveness probes (SRE requirement)
- Service mesh injection: enabled (label: istio-injection=enabled)
- Must use our naming convention: {team}-{service}-{environment}
- Anti-affinity rules: spread across 3 AZs
- Secret management: External Secrets Operator pulling from AWS Secrets Manager
"
Your current state
If you're modifying existing infrastructure, the AI needs to know what's already there:
"I need to add an egress rule to this security group:
[paste current terraform state or aws ec2 describe-security-groups output]
Add egress to 10.50.0.0/16 on port 5432 for PostgreSQL.
Important:
- Don't modify existing rules
- Follow the description pattern: 'Allow {protocol} to {destination} for {purpose}'
- Keep the rules in alphabetical order by description
"
This takes 60 seconds of setup but prevents hours of rework.
Layer 2: Validation Before Apply (The Safety Net)
Never run AI-generated infrastructure code without these three checks:
1. Plan Review (Human + AI)
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
Then ask the AI to review its own output:
"Review this Terraform plan for issues:
[paste plan.json]
Check for:
- Unintended resource deletions or replacements
- Changes to production databases or stateful resources
- Security group rules that are too permissive
- Missing required tags
- Resources without explicit dependencies that might cause race conditions
"
I've caught 3 production-breaking changes in the past month this way. The AI is surprisingly good at spotting its own mistakes when shown the plan.
2. Compliance Scanning
Use tools like Checkov, tfsec, or Terrascan on AI-generated code:
checkov -f main.tf --framework terraform
Most AI models don't know your organization's compliance requirements. Automated scanning catches:
- Unencrypted storage
- Public S3 buckets
- Security groups with 0.0.0.0/0 ingress
- Missing backup configurations
- Overly permissive IAM policies
3. Targeted Questions
After getting code from an AI, ask follow-up questions about edge cases:
"This Terraform creates an RDS instance. What happens if:
1. The instance fails and needs to be replaced?
2. We need to restore from a snapshot?
3. We want to upgrade the engine version?
4. We need to change the instance class?
Will we lose data? Are there manual steps?"
This exposes assumptions the AI made that don't match your requirements.
Layer 3: Incremental Rollout (The Undo Button)
AI-generated infrastructure changes should follow the same rollout discipline as application code:
Development → Staging → Production
Apply AI-generated Terraform to dev first. Let it run for 24 hours. Check CloudWatch metrics, logs, costs. Only then promote to staging.
Feature Flags for Infrastructure
Use Terraform workspaces or separate state files:
resource "aws_instance" "app" {
instance_type = var.use_ai_generated_config ? "t3.large" : "t3.medium"
# AI suggested t3.large for better performance
}
You can toggle back instantly if something breaks.
Version Control Everything
Every AI-generated change goes through a PR with:
- Before/after plan output
- Explanation of what changed and why
- Screenshot of AI conversation (for context)
- Checkov scan results
This creates an audit trail and lets your team review before merging.
Prompt Patterns That Actually Work
Here are the prompt structures I use daily:
Pattern 1: The "Existing Module Extension"
I have this Terraform module for creating ECS services:
[paste module]
I need to extend it to support:
- Blue/green deployments
- Circuit breaker configuration
- CloudWatch Container Insights
Extend the module maintaining:
- Same variable naming convention (use_underscores)
- Same output structure (prefix with module name)
- Backward compatibility (new features off by default)
Show me:
1. Updated variables.tf
2. Updated main.tf with new resources
3. Updated outputs.tf
4. Example usage in a root module
Pattern 2: The "Debug This Error"
I'm getting this Terraform error:
[paste full error output]
Here's the relevant configuration:
[paste terraform code]
Here's the current state:
[paste terraform state show output for affected resource]
What's wrong and how do I fix it without destroying existing resources?
Pattern 3: The "Compliance Review"
Review this Terraform configuration for security and compliance issues:
[paste code]
Our requirements:
- All storage must be encrypted at rest
- No public internet access except through approved load balancers
- All resources must have Owner, Environment, and CostCenter tags
- IAM policies follow least privilege
- Secrets must use AWS Secrets Manager, never hardcoded
Flag anything that violates these rules and suggest fixes.
Pattern 4: The "Migration Plan"
I need to migrate this infrastructure from [old pattern] to [new pattern]:
Current:
[paste current terraform]
Target:
[describe desired end state]
Constraints:
- Zero downtime required
- Database cannot be recreated (contains production data)
- Must maintain existing DNS records
- Budget: can't exceed $500/month increase
Give me a step-by-step migration plan with:
1. What to create first
2. What to migrate
3. What to destroy last
4. Rollback plan if something fails
Real Examples: What I Built This Week
Example 1: EKS Cluster with AI Assistance
What I needed: Production-ready EKS cluster with VPC, IAM roles, node groups, and add-ons.
What I did:
- Context dump: Pasted our existing VPC module, naming conventions, and tagging requirements
- Specific ask: "Create EKS cluster Terraform following these patterns"
- Iteration: Asked AI to add Karpenter, AWS Load Balancer Controller, and metrics-server
- Validation: Ran checkov, caught missing encryption on EBS volumes, asked AI to fix
- Testing: Applied to dev, tested pod networking, service exposure, autoscaling
- Result: Deployed to production in 3 days instead of 2 weeks
Time saved: 8-10 days Confidence level: High (because I validated every layer)
Example 2: Kubernetes YAML Debugging
What I needed: Fix a CrashLoopBackOff on a deployment that worked locally but failed in production.
What I did:
Prompt: "This deployment is CrashLoopBackOff in production but works locally:
[paste deployment YAML]
Pod logs show:
[paste kubectl logs output]
Environment differences:
- Production uses Workload Identity for AWS access
- Production has Istio sidecar injection
- Production has strict PodSecurityPolicy
What's likely wrong?"
AI Response: "The container is trying to access AWS Secrets Manager but the ServiceAccount doesn't have the correct annotation for Workload Identity..."
Fixed in 5 minutes. Would have taken 30-45 minutes of trial and error.
Example 3: Terraform State Surgery
What I needed: A resource was manually modified in AWS console, now Terraform wants to destroy and recreate it.
Prompt:
"This RDS instance was manually modified. Terraform plan shows:
[paste plan output showing replacement]
I need to update the Terraform state to match current AWS state without destroying the database. Walk me through the process."
AI gave me:
1. Pull current state: terraform state pull > backup.tfstate
2. Get actual AWS values: aws rds describe-db-instances ...
3. Update state: terraform state rm aws_db_instance.main
4. Import with new values: terraform import ...
5. Update code to match
Saved the database from accidental destruction.
Common Mistakes (And How to Avoid Them)
Mistake 1: Trusting Default Suggestions
Problem: AI suggests internet-facing load balancer when you wanted internal.
Fix: Always specify "internal" or "internet-facing" explicitly in your prompt. Don't assume the AI will infer from context.
Mistake 2: Not Checking for Breaking Changes
Problem: AI refactors your Terraform and accidentally changes resource names, causing Terraform to want to destroy and recreate.
Fix: Before accepting any refactoring, run terraform plan and look for -/+ symbols (replace). Ask AI: "Will this force resource replacement?"
Mistake 3: Applying Directly to Production
Problem: You're in a rush, AI code looks good, you apply straight to prod. It breaks.
Fix: No exceptions. Always apply to dev first. Set up CI/CD that enforces this.
Mistake 4: Not Version Controlling the Conversation
Problem: Three months later you need to know why a particular configuration was chosen. You have no record.
Fix: Save the AI conversation as docs/decisions/YYYY-MM-DD-why-we-chose-X.md in your repo.
Mistake 5: Treating AI Like a Magic Button
Problem: You ask AI to "set up full AWS infrastructure for my app" and expect it to work perfectly.
Fix: Break large requests into small, testable pieces. Build incrementally. Validate each layer.
Tool Recommendations
For Writing Terraform
Best: Claude Opus 4.6 (better at understanding complex state and dependencies) Alternative: GPT-4.1 Turbo (faster for simple modules) Free option: Claude Sonnet 4.5 (surprisingly good)
For Kubernetes Manifests
Best: Claude Sonnet 4.5 (handles YAML indentation correctly) Alternative: Gemini 2.5 Pro (good at Helm charts)
For Debugging
Best: Claude Opus 4.6 (best at reasoning through complex error chains)
For Learning/Explaining
Best: GPT-4.1 (clearest explanations)
Use my Token Counter to estimate costs before sending large Terraform files to expensive models. Claude Sonnet often works for 80% of tasks at 1/10th the cost of Opus.
Advanced: Teaching AI Your Environment
If you use AI regularly for infrastructure work, create a "context file" you paste at the start of every conversation:
# Our Infrastructure Context
## Naming Convention
{project}-{environment}-{resource}-{purpose}
Example: payments-prod-rds-primary
## Tagging Requirements (all resources)
- Owner: team-name
- Environment: dev | staging | prod
- CostCenter: 4-digit code
- ManagedBy: terraform | cloudformation | manual
## AWS Defaults
- Region: us-west-2 (primary), us-east-1 (DR)
- VPC CIDR: 10.100.0.0/16 (prod), 10.101.0.0/16 (staging)
- Encryption: Always enabled, use aws/kms or custom KMS keys
- Versioning: Enabled on all S3 buckets
- Lifecycle: S3 IA after 30 days, Glacier after 90 days
## Kubernetes Standards
- Base images: only from company ECR
- Resource limits: required on all deployments
- Probes: both liveness and readiness required
- Service mesh: Istio with automatic sidecar injection
- Secrets: External Secrets Operator + AWS Secrets Manager
## Security Requirements
- No 0.0.0.0/0 ingress except load balancers
- IAM: least privilege, no *:* permissions
- Secrets: never in code, always in Secrets Manager
- Public access: must be approved by security team
## Terraform Standards
- Version: >= 1.5
- State: S3 backend with DynamoDB locking
- Modules: internal registry only
- Variables: always have descriptions and validation
- Outputs: prefix with module name
Save this file, paste it at the start of infrastructure conversations. The AI will follow your patterns instead of generic best practices.
The Future: Where This Is Going
Three things I'm seeing emerge:
1. AI-Powered Policy as Code
Instead of manually writing Sentinel or OPA policies, describe your requirements in English and let AI generate the policy code. I'm doing this now for AWS Service Control Policies.
2. Autonomous Infrastructure Refactoring
AI that can analyze your entire Terraform codebase, suggest improvements, and generate PRs. We're close to this. Claude Code is already capable of refactoring entire modules.
3. Self-Healing Infrastructure
AI agents that detect drift, generate the fix, test in a sandbox, and auto-apply if safe. This is what I'm building now with Claude + CloudWatch + Lambda.
The gap between "AI wrote it" and "production-ready" is shrinking fast.
Checklist: Before You Use AI for Infrastructure
- Have you provided your existing patterns and conventions?
- Have you specified all constraints explicitly?
- Will you run
terraform planbefore applying? - Will you run compliance scanning (Checkov/tfsec)?
- Are you applying to dev first, not production?
- Have you tested rollback?
- Have you documented why you made this change?
- Will this change be reviewed by another human?
If you answered no to any of these, stop. Fix that first.
What to Do Next
If you're new to using AI for infrastructure:
- Start small: Use AI to explain existing Terraform you inherited
- Build templates: Ask AI to generate module templates following your conventions
- Debug faster: Paste errors and let AI suggest fixes
- Learn patterns: Ask AI to critique your code and suggest improvements
If you're already using AI:
- Create your context file (see Advanced section above)
- Set up validation pipelines (Checkov + automated plan review)
- Document your prompts (what works, what doesn't)
- Train your team (share the patterns that work)
The goal isn't to let AI write all your infrastructure code unsupervised. The goal is to move faster while maintaining the same level of quality and safety.
Tools I Built for This
I got tired of switching between tools, so I built a free suite that helps with AI-assisted infrastructure work:
- Token Counter: Estimate costs before pasting large Terraform files into expensive models
- AI Output Parser: Extract clean JSON or YAML from AI responses buried in markdown
- YAML Validator: Validate Kubernetes manifests AI generates before applying
- Diff Checker: Compare AI-suggested changes side-by-side with your existing code
- Prompt Eval Suite: Score your infrastructure prompts against 12 best practices
- JSON to TypeScript: Convert AWS API responses to TypeScript for CDK projects
All free, no signup, run entirely in your browser.
AI coding assistants won't replace DevOps engineers. But DevOps engineers using AI assistants effectively will replace those who don't.
The difference is knowing what to ask, how to validate, and when to push back.
Further Reading:
- Building Self-Healing Infrastructure with Claude Code
- Prompt Engineering for Cloud Engineers
- Automating AWS Operations Controls with AI
- The Definitive Guide to Setting Up Claude Code
Questions? Email me at phaqqani@gmail.com or find me on LinkedIn.